Name | Income | Age | Education | Vendor |
Blue Blood Estates | High | 35-54 | College | PRIZM |
Shotguns and Pickup | Middle | 35-64 | High school | PRIZM |
Southside City | Low | Mix | Grade school | MicroVision |
Living off Land | Middle-Low | Families with children | Low | MicroVision |
University USA | Very Low | Young-Mix | Medium-High | MicroVision |
Sunset Years | Medium | Seniors | Medium | Microvision |
ID | Name | Age | Balance ($) | Income | Eyes | Gender |
1 | Amy | 62 | 0 | Medium | Brown | F |
2 | Al | 53 | 1,800 | Medium | Green | M |
3 | Betty | 47 | 16,543 | High | Brown | F |
4 | Bob | 32 | 45 | Medium | Green | M |
5 | Carla | 21 | 2,300 | High | Blue | F |
6 | Carl | 27 | 5,400 | High | Brown | M |
7 | Donna | 50 | 165 | Low | Blue | F |
8 | Don | 46 | 0 | High | Blue | F |
9 | Edna | 27 | 500 | Low | Blue | F |
10 | Ed | 68 | 1,200 | Low | Blue | M |
How would you cluster this data?
For i in NumberOfClusters Randomly initialize i clusters Do Compute class likelihood vectors Compute normalized probabilities for each data point Update class model parameters Analyze new parameters that will maximize probabilities (For normal function, recalculate mean, variance, skewness, kurtosis) Until convergence (sum of classes' log marginal probability > threshold or no change)
imports-85c.hd2
num_db2_format_defs 2 number_of_attributes 26 separator_char ',' ; Can also supply comment char and unknown token 0 discrete nominal "symboling" range 7 1 real scalar "normalized-loses" zero_point 0.0 rel_error 0.01 2 discrete nominal "make" range 22 3 discrete nominal "fuel-type" range 2 4 discrete nominal "aspiration" range 2 5 discrete nominal "num-of-doors" range 2 6 discrete nominal "body-style" range 5 7 discrete nominal "drive-wheels" range 3 8 discrete nominal "engine-location" range 2 9 real scalar "wheel-base" zero_point 0.0 rel_error 0.001 10 real scalar "length" zero_point 0.0 rel_error 0.001 11 real scalar "width" zero_point 0.0 rel_error 0.001 12 real scalar "height" zero_point 0.0 rel_error 0.001 13 real scalar "curb-weight" zero_point 0.0 rel_error 0.0002 14 discrete nominal "engine-type" range 7 15 discrete nominal "num-of-cylinders" range 7 16 real scalar "engine-size" zero_point 0.0 rel_error 0.01 17 discrete nominal "fuel-system" range 8 18 real scalar "bore" zero_point 0.0 rel_error 0.003 19 real scalar "stroke" zero_point 0.0 rel_error 0.003 20 real scalar "compression-ratio" zero_point 0.0 rel_error 0.003 21 real scalar "horse-power" zero_point 0.0 rel_error 0.01 22 real scalar "peak-rpm" zero_point 0.0 rel_error 0.02 23 real scalar "city-mpg" zero_point 0.0 rel_error 0.04 24 real scalar "highway-mpg" zero_point 0.0 rel_error 0.04 25 real scalar "price" zero_point 0.0 rel_error 0.001
imports-85c.db2
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548, dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548, dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500 1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823, ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500 2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four, 109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950 2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five, 136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450 2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136, mpfi,3.19,3.40,8.50,110,5500,19,25,15250 1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five, 136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710 1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five, 136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920 1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,3086,ohc, five,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875 0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,3053,ohc, five,131,mpfi,3.13,3.40,7.00,160,5500,16,22,? 2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four, 108,mpfi,3.50,2.80,8.80,101,5800,23,29,16430 ...
imports-85c.model
model_index 0 4 ignore 0 single_normal_cm 1 18 19 21 22 25 single_normal_cn 9 10 11 12 13 16 20 23 24 single_multinomial default
imports-85c.s-params (abbreviated)
# start_j_list = 2, 3, 5, 7, 10, 15, 25 # min_report_period = 30 # max_duration = 0 # max_n_tries = 0 # n_save = 2 ...
imports-85c.log
AUTOCLASS C (version 2.5) STARTING at Mon Jun 26 16:30:39 1995 AUTOCLASS -SEARCH default parameters: ... WELCOME TO AUTOCLASS. 1) Each time I have finished a new 'trial', or attempt to find a good classification, I will print the number of classes that trial started and ended with, such as 9->7. 2) If that trial results in a duplicate of a previous run, I will print print 'dup' first. 3) If that trial results in a classification better than any previous, I will print 'best' first. 4) If more than 30 seconds have passed since the last report, and a new classification has been found which is better than any previous ones, I will report on that classification and on the status of the search so far. 5) This report will include an estimate of the time it will take to find another even better classification, and how much better that will be. In addition, I will estiamte a lower bound on how long it might take to find the very best classification, and how much better that might be. 6) If you are warned about too much time in overhead, you may want to change the parameters n_save, min_save_period, min_report_period, or min_checkpoint_period. 7) To quit searching, type a 'q', hit <return>, and wait. Otherwise I'll go on until I complete trial number (12). 8) If needed, every 30 minutes I will save the best 2 classifications so far to file: /home/tove/p/autoclass-c/sample/imports-85c.results-bin and a description of the search to file: /home/tove/p/autoclass-c/sample/imports-85c.search 9) A record of this search will be printed to file: /home/tove/p/autoclass-c/sample/imports-85c.log BEGINNING SEARCH at Mon Jun 26 16:30:40 1995 [j_in=2] [cs-3: cycles 15] best2->2(1) [j_in=3] [cs-3: cycles 49] best3->3(2) [j_in=5] [cs-3: cycles 12] best5->5(3) [j_in=7] [cs-3: cycles 11] best7->7(4) [j_in=10] [cs-3: cycles 14] best10->10(5) [j_in=15] [cs-3: cycles 28] 15->15(6) [j_in=25] [cs-3: cycles 10] 25->22(7) ---------------- NEW BEST CLASSIFICATION FOUND on try 5 ------------- It has 10 CLASSES with WEIGHTS 32 30 28 24 21 21 20 11 10 8 PROBABILITY of both the data and the classification = exp(-16368.367) (Also found 4 other better than last report.) ----------- SEARCH STATUS as of Mon Jun 26 16:31:12 1995 ----------- It just took 32 seconds since beginning. Estimate < 28 seconds to find a classification exp(61.7) [= 6.0e+26] times more probable. Estimate >> 1 minute 6 seconds to find the very best classification, which may be exp(28.6) to exp(11764.5) times more probable. Have seen 7 of the estimated > 21 possible classifications (based on 0 duplicates do far). Log-Normal fit to classifications probabilities has M(ean) -16598.5, S(igma) 154.9 Choosing initial n-classes randomly from a log_normal [M-S, M, M+S] = [2.9, 7.0, 16.9] Overhead time is 3.0 % of total search time [j_in=9] [cs-3: cycles 10] 9->9(8) [j_in=3] [cs-3: cycles 11] 3->3(9) [j_in=5] [cs-3: cycles 48] 5->5(10) [j_in=3] [cs-3: cycles 18] 3->3(11) [j_in=5] [cs-3: cycles 35] 5->5(12) ENDING SEARCH because max number of tries reached at Mon Jun 26 16:31:32 1995 after a total of 12 tries over 53 seconds A log of this search is in file: /home/tove/p/autoclass-c/sample/imports-85c.log The search results are stored in file: /home/tove/p/autoclass-c/sample/imports-85c.results-bin This search can be restarted by having "force_new_search_p = false" in file: /home/tove/p/autoclass-c/sample/imports-85c.s-params and reinvoking the "autoclass -search ..." form ------------------ SUMMARY OF 10 BEST RESULTS ------------------ PROBABILITY: exp(-16368.367) N_CLASSES: 10 FOUND ON TRY: 5 *SAVED* PROBABILITY: exp(-16477.345) N_CLASSES: 9 FOUND ON TRY: 8 *SAVED* PROBABILITY: exp(-16537.556) N_CLASSES: 15 FOUND ON TRY: 6 PROBABILITY: exp(-16542.413) N_CLASSES: 7 FOUND ON TRY: 4 PROBABILITY: exp(-16590.504) N_CLASSES: 5 FOUND ON TRY: 10 PROBABILITY: exp(-16617.452) N_CLASSES: 5 FOUND ON TRY: 3 PROBABILITY: exp(-16632.595) N_CLASSES: 5 FOUND ON TRY: 12 PROBABILITY: exp(-16673.545) N_CLASSES: 22 FOUND ON TRY: 7 PROBABILITY: exp(-16759.053) N_CLASSES: 3 FOUND ON TRY: 2 PROBABILITY: exp(-16898.385) N_CLASSES: 3 FOUND ON TRY: 9 ...
imports-85c.class-text-1
CROSS REFERENCE: CLASS => CASE NUMBER MEMBERSHIP AutoClass CLASSIFICATION for the 205 cases in: /home/centauri/cook/projects/ac/sample/imports-85c.db2 /home/centauri/cook/projects/ac/sample/imports-85c.hd2 with log-A<X/H> (approximate marginal likelihood) = -16564.197 from classification results file: /home/centauri/cook/projects/ac/sample/imports-85c.results-bin and using models: /home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0 CLASS = 0 Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 5 audi four sedan 0.99 7 audi four sedan 1.00 8 audi four wagon 1.00 9 audi four sedan 1.00 10 audi two hatchback 1.00 15 bmw four sedan 1.00 16 bmw four sedan 1.00 17 bmw two sedan 1.00 18 bmw four sedan 1.00 48 jaguar four sedan 1.00 49 jaguar four sedan 1.00 50 jaguar two sedan 1.00 68 mercedes-benz four sedan 1.00 69 mercedes-benz four wagon 1.00 70 mercedes-benz two hardtop 1.00 71 mercedes-benz four sedan 1.00 ... CLASS = 1 Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 1 alfa-romero two convertible 1.00 2 alfa-romero two convertible 1.00 3 alfa-romero two hatchback 1.00 11 bmw two sedan 1.00 12 bmw four sedan 1.00 13 bmw two sedan 1.00 14 bmw four sedan 0.99 0 0.01 30 dodge two hatchback 1.00 47 isuzu two hatchback 1.00 56 mazda two hatchback 1.00 57 mazda two hatchback 1.00 58 mazda two hatchback 1.00 59 mazda two hatchback 1.00 66 mazda four sedan 0.99 76 mercury two hatchback 1.00 83 mitsubishi two hatchback 1.00 84 mitsubishi two hatchback 1.00 85 mitsubishi two hatchback 1.00 105 nissan two hatchback ... CLASS = 2 Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 19 chevrolet two hatchback 1.00 20 chevrolet two hatchback 1.00 21 chevrolet four sedan 1.00 22 dodge two hatchback 1.00 23 dodge two hatchback 1.00 31 honda two hatchback 1.00 32 honda two hatchback 1.00 33 honda two hatchback 1.00 34 honda two hatchback 1.00 35 honda two hatchback 1.00 36 honda four sedan 1.00 37 honda four wagon 1.00 45 isuzu two sedan 1.00 46 isuzu four sedan 1.00 51 mazda two hatchback 1.00 ... CLASS = 9 (continued) Case # make num-of-doors body-style (Cls Prob) -------------------------------------------------------------------------------- 81 mitsubishi two hatchback 1.00 88 mitsubishi four sedan 1.00 89 mitsubishi four sedan 1.00 120 plymouth two hatchback 1.00 190 volkswagen two convertible 0.99
imports-85c.case-text-1
CROSS REFERENCE: CASE NUMBER => MOST PROBABLE CLASS AutoClass CLASSIFICATION for the 205 cases in: /home/centauri/cook/projects/ac/sample/imports-85c.db2 /home/centauri/cook/projects/ac/sample/imports-85c.hd2 with log-A<X/H> (approximate marginal likelihood) = -16564.197 from classification results file: /home/centauri/cook/projects/ac/sample/imports-85c.results-bin and using models: /home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0 Case # Class Prob Case # Class Prob Case # Class Prob -------------------------------------------------------------------------------- 1 1 1.00 47 1 0.99 93 2 1.00 2 1 1.00 48 0 1.00 94 2 0.99 3 1 1.00 49 0 1.00 95 2 1.00 4 3 0.99 50 0 1.00 96 2 0.99 5 0 0.99 51 2 0.99 97 2 0.99 6 4 0.99 52 2 0.99 98 2 0.99 7 0 1.00 53 2 0.99 99 2 0.99 8 0 1.00 54 2 0.99 100 3 0.99 9 0 1.00 55 2 0.99 101 3 0.99 10 0 0.99 56 1 1.00 102 0 0.99 ...
imports-85c.influ-o-text-1
... CLASSIFICATION HAS 10 POPULATED CLASSES: (max global influence value = 7.063) We give below a heuristic measure of class strength: the approximate geometric mean probability for instances belonging to each class, computed from the class parameters and statistics. This approximates the contribution made, by any one instance "belonging" to the class, to the log probability of the data set w.r.t. the classification. It thus provides a heuristic measure of how strongly each class predicts "its" instances. Class Log of class Relative Class Normalized num strength class strength weight class weight 0 -8.25e+01 1.64e-10 51 0.249 1 -8.01e+01 1.69e-09 39 0.190 2 -6.99e+01 4.89e-05 29 0.141 3 -6.86e+01 1.75e-04 18 0.088 4 -7.25e+01 3.58e-06 16 0.078 5 -6.86e+01 1.68e-04 14 0.068 6 -7.11e+01 1.43e-05 12 0.059 7 -5.99e+01 1.00e+00 9 0.044 8 -6.95e+01 7.20e-05 9 0.044 9 -6.95e+01 6.73e-05 8 0.039 ... ORDERED LIST OF NORMALIZED ATTRIBUTE INFLUENCE VALUES SUMMED OVER ALL CLASSES: This gives a rough heuristic measure of relative influence of each attribute in differentiating the classes from the overall data set. Note that "influence values" are only computable with respect to the model terms. When multiple attributes are modeled by a single dependent term (e.g. multi_normal_cn), the term influence value is distributed equally over the modeled attributes. num description I-*k 38: Log compression-ratio 1.000 36: Log curb-weight 0.607 29: Log horse-power 0.604 2: make 0.589 37: Log engine-size 0.582 32: Log wheel-base 0.550 28: Log stroke 0.515 33: Log length 0.496 31: Log price 0.487 34: Log width 0.437 17: fuel-system 0.414 27: Log bore 0.408 26: Log normalized-loses 0.305 35: Log height 0.292 39: Log city-mpg 0.222 7: drive-wheels 0.209 40: Log highway-mpg 0.191 14: engine-type 0.160 6: body-style 0.130 3: fuel-type 0.121 5: num-of-doors 0.106 30: Log peak-rpm 0.106 15: num-of-cylinders 0.089 4: aspiration 0.075 8: engine-location 0.009 0: symboling ----- 1: normalized-loses ----- ...
Results
Results after removing gene duplication
Results