| Name | Income | Age | Education | Vendor |
| Blue Blood Estates | High | 35-54 | College | PRIZM |
| Shotguns and Pickup | Middle | 35-64 | High school | PRIZM |
| Southside City | Low | Mix | Grade school | MicroVision |
| Living off Land | Middle-Low | Families with children | Low | MicroVision |
| University USA | Very Low | Young-Mix | Medium-High | MicroVision |
| Sunset Years | Medium | Seniors | Medium | Microvision |
| ID | Name | Age | Balance ($) | Income | Eyes | Gender |
| 1 | Amy | 62 | 0 | Medium | Brown | F |
| 2 | Al | 53 | 1,800 | Medium | Green | M |
| 3 | Betty | 47 | 16,543 | High | Brown | F |
| 4 | Bob | 32 | 45 | Medium | Green | M |
| 5 | Carla | 21 | 2,300 | High | Blue | F |
| 6 | Carl | 27 | 5,400 | High | Brown | M |
| 7 | Donna | 50 | 165 | Low | Blue | F |
| 8 | Don | 46 | 0 | High | Blue | F |
| 9 | Edna | 27 | 500 | Low | Blue | F |
| 10 | Ed | 68 | 1,200 | Low | Blue | M |
How would you cluster this data?

For i in NumberOfClusters
Randomly initialize i clusters
Do
Compute class likelihood vectors
Compute normalized probabilities for each data point
Update class model parameters
Analyze new parameters that will maximize probabilities
(For normal function, recalculate mean, variance, skewness, kurtosis)
Until convergence (sum of classes' log marginal probability > threshold or
no change)
imports-85c.hd2
num_db2_format_defs 2 number_of_attributes 26 separator_char ',' ; Can also supply comment char and unknown token 0 discrete nominal "symboling" range 7 1 real scalar "normalized-loses" zero_point 0.0 rel_error 0.01 2 discrete nominal "make" range 22 3 discrete nominal "fuel-type" range 2 4 discrete nominal "aspiration" range 2 5 discrete nominal "num-of-doors" range 2 6 discrete nominal "body-style" range 5 7 discrete nominal "drive-wheels" range 3 8 discrete nominal "engine-location" range 2 9 real scalar "wheel-base" zero_point 0.0 rel_error 0.001 10 real scalar "length" zero_point 0.0 rel_error 0.001 11 real scalar "width" zero_point 0.0 rel_error 0.001 12 real scalar "height" zero_point 0.0 rel_error 0.001 13 real scalar "curb-weight" zero_point 0.0 rel_error 0.0002 14 discrete nominal "engine-type" range 7 15 discrete nominal "num-of-cylinders" range 7 16 real scalar "engine-size" zero_point 0.0 rel_error 0.01 17 discrete nominal "fuel-system" range 8 18 real scalar "bore" zero_point 0.0 rel_error 0.003 19 real scalar "stroke" zero_point 0.0 rel_error 0.003 20 real scalar "compression-ratio" zero_point 0.0 rel_error 0.003 21 real scalar "horse-power" zero_point 0.0 rel_error 0.01 22 real scalar "peak-rpm" zero_point 0.0 rel_error 0.02 23 real scalar "city-mpg" zero_point 0.0 rel_error 0.04 24 real scalar "highway-mpg" zero_point 0.0 rel_error 0.04 25 real scalar "price" zero_point 0.0 rel_error 0.001
imports-85c.db2
3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548, dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495 3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,168.80,64.10,48.80,2548, dohc,four,130,mpfi,3.47,2.68,9.00,111,5000,21,27,16500 1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.50,171.20,65.50,52.40,2823, ohcv,six,152,mpfi,2.68,3.47,9.00,154,5000,19,26,16500 2,164,audi,gas,std,four,sedan,fwd,front,99.80,176.60,66.20,54.30,2337,ohc,four, 109,mpfi,3.19,3.40,10.00,102,5500,24,30,13950 2,164,audi,gas,std,four,sedan,4wd,front,99.40,176.60,66.40,54.30,2824,ohc,five, 136,mpfi,3.19,3.40,8.00,115,5500,18,22,17450 2,?,audi,gas,std,two,sedan,fwd,front,99.80,177.30,66.30,53.10,2507,ohc,five,136, mpfi,3.19,3.40,8.50,110,5500,19,25,15250 1,158,audi,gas,std,four,sedan,fwd,front,105.80,192.70,71.40,55.70,2844,ohc,five, 136,mpfi,3.19,3.40,8.50,110,5500,19,25,17710 1,?,audi,gas,std,four,wagon,fwd,front,105.80,192.70,71.40,55.70,2954,ohc,five, 136,mpfi,3.19,3.40,8.50,110,5500,19,25,18920 1,158,audi,gas,turbo,four,sedan,fwd,front,105.80,192.70,71.40,55.90,3086,ohc, five,131,mpfi,3.13,3.40,8.30,140,5500,17,20,23875 0,?,audi,gas,turbo,two,hatchback,4wd,front,99.50,178.20,67.90,52.00,3053,ohc, five,131,mpfi,3.13,3.40,7.00,160,5500,16,22,? 2,192,bmw,gas,std,two,sedan,rwd,front,101.20,176.80,64.80,54.30,2395,ohc,four, 108,mpfi,3.50,2.80,8.80,101,5800,23,29,16430 ...
imports-85c.model
model_index 0 4 ignore 0 single_normal_cm 1 18 19 21 22 25 single_normal_cn 9 10 11 12 13 16 20 23 24 single_multinomial default
imports-85c.s-params (abbreviated)
# start_j_list = 2, 3, 5, 7, 10, 15, 25 # min_report_period = 30 # max_duration = 0 # max_n_tries = 0 # n_save = 2 ...
imports-85c.log
AUTOCLASS C (version 2.5) STARTING at Mon Jun 26 16:30:39 1995
AUTOCLASS -SEARCH default parameters:
...
WELCOME TO AUTOCLASS.
1) Each time I have finished a new 'trial', or attempt to find a good
classification, I will print the number of classes that trial
started and ended with, such as 9->7.
2) If that trial results in a duplicate of a previous run, I will print
print 'dup' first.
3) If that trial results in a classification better than any previous,
I will print 'best' first.
4) If more than 30 seconds have passed since the last report, and a new
classification has been found which is better than any previous ones,
I will report on that classification and on the status of the search
so far.
5) This report will include an estimate of the time it will take to find
another even better classification, and how much better that will be.
In addition, I will estiamte a lower bound on how long it might take to
find the very best classification, and how much better that might be.
6) If you are warned about too much time in overhead, you may want to
change the parameters n_save, min_save_period, min_report_period, or
min_checkpoint_period.
7) To quit searching, type a 'q', hit <return>, and wait. Otherwise I'll
go on until I complete trial number (12).
8) If needed, every 30 minutes I will save the best 2 classifications
so far to file:
/home/tove/p/autoclass-c/sample/imports-85c.results-bin
and a description of the search to file:
/home/tove/p/autoclass-c/sample/imports-85c.search
9) A record of this search will be printed to file:
/home/tove/p/autoclass-c/sample/imports-85c.log
BEGINNING SEARCH at Mon Jun 26 16:30:40 1995
[j_in=2] [cs-3: cycles 15] best2->2(1) [j_in=3] [cs-3: cycles 49] best3->3(2) [j_in=5] [cs-3: cycles 12] best5->5(3) [j_in=7] [cs-3: cycles 11] best7->7(4) [j_in=10] [cs-3: cycles 14] best10->10(5) [j_in=15] [cs-3: cycles 28] 15->15(6) [j_in=25] [cs-3: cycles 10] 25->22(7)
---------------- NEW BEST CLASSIFICATION FOUND on try 5 -------------
It has 10 CLASSES with WEIGHTS 32 30 28 24 21 21 20 11 10 8
PROBABILITY of both the data and the classification = exp(-16368.367)
(Also found 4 other better than last report.)
----------- SEARCH STATUS as of Mon Jun 26 16:31:12 1995 -----------
It just took 32 seconds since beginning.
Estimate < 28 seconds to find a classification
exp(61.7) [= 6.0e+26] times more probable.
Estimate >> 1 minute 6 seconds to find the very best classification,
which may be exp(28.6) to exp(11764.5) times more probable.
Have seen 7 of the estimated > 21 possible classifications (based on 0
duplicates do far).
Log-Normal fit to classifications probabilities has M(ean) -16598.5,
S(igma) 154.9
Choosing initial n-classes randomly from a log_normal [M-S, M, M+S] =
[2.9, 7.0, 16.9]
Overhead time is 3.0 % of total search time
[j_in=9] [cs-3: cycles 10] 9->9(8) [j_in=3] [cs-3: cycles 11] 3->3(9) [j_in=5] [cs-3: cycles 48] 5->5(10) [j_in=3] [cs-3: cycles 18] 3->3(11) [j_in=5] [cs-3: cycles 35] 5->5(12)
ENDING SEARCH because max number of tries reached at Mon Jun 26 16:31:32 1995
after a total of 12 tries over 53 seconds
A log of this search is in file:
/home/tove/p/autoclass-c/sample/imports-85c.log
The search results are stored in file:
/home/tove/p/autoclass-c/sample/imports-85c.results-bin
This search can be restarted by having "force_new_search_p = false" in file:
/home/tove/p/autoclass-c/sample/imports-85c.s-params
and reinvoking the "autoclass -search ..." form
------------------ SUMMARY OF 10 BEST RESULTS ------------------
PROBABILITY: exp(-16368.367) N_CLASSES: 10 FOUND ON TRY: 5 *SAVED*
PROBABILITY: exp(-16477.345) N_CLASSES: 9 FOUND ON TRY: 8 *SAVED*
PROBABILITY: exp(-16537.556) N_CLASSES: 15 FOUND ON TRY: 6
PROBABILITY: exp(-16542.413) N_CLASSES: 7 FOUND ON TRY: 4
PROBABILITY: exp(-16590.504) N_CLASSES: 5 FOUND ON TRY: 10
PROBABILITY: exp(-16617.452) N_CLASSES: 5 FOUND ON TRY: 3
PROBABILITY: exp(-16632.595) N_CLASSES: 5 FOUND ON TRY: 12
PROBABILITY: exp(-16673.545) N_CLASSES: 22 FOUND ON TRY: 7
PROBABILITY: exp(-16759.053) N_CLASSES: 3 FOUND ON TRY: 2
PROBABILITY: exp(-16898.385) N_CLASSES: 3 FOUND ON TRY: 9
...
imports-85c.class-text-1
CROSS REFERENCE: CLASS => CASE NUMBER MEMBERSHIP
AutoClass CLASSIFICATION for the 205 cases in:
/home/centauri/cook/projects/ac/sample/imports-85c.db2
/home/centauri/cook/projects/ac/sample/imports-85c.hd2
with log-A<X/H> (approximate marginal likelihood) = -16564.197
from classification results file:
/home/centauri/cook/projects/ac/sample/imports-85c.results-bin
and using models:
/home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0
CLASS = 0
Case # make num-of-doors body-style (Cls Prob)
--------------------------------------------------------------------------------
5 audi four sedan 0.99
7 audi four sedan 1.00
8 audi four wagon 1.00
9 audi four sedan 1.00
10 audi two hatchback 1.00
15 bmw four sedan 1.00
16 bmw four sedan 1.00
17 bmw two sedan 1.00
18 bmw four sedan 1.00
48 jaguar four sedan 1.00
49 jaguar four sedan 1.00
50 jaguar two sedan 1.00
68 mercedes-benz four sedan 1.00
69 mercedes-benz four wagon 1.00
70 mercedes-benz two hardtop 1.00
71 mercedes-benz four sedan 1.00
...
CLASS = 1
Case # make num-of-doors body-style (Cls Prob)
--------------------------------------------------------------------------------
1 alfa-romero two convertible 1.00
2 alfa-romero two convertible 1.00
3 alfa-romero two hatchback 1.00
11 bmw two sedan 1.00
12 bmw four sedan 1.00
13 bmw two sedan 1.00
14 bmw four sedan 0.99
0 0.01
30 dodge two hatchback 1.00
47 isuzu two hatchback 1.00
56 mazda two hatchback 1.00
57 mazda two hatchback 1.00
58 mazda two hatchback 1.00
59 mazda two hatchback 1.00
66 mazda four sedan 0.99
76 mercury two hatchback 1.00
83 mitsubishi two hatchback 1.00
84 mitsubishi two hatchback 1.00
85 mitsubishi two hatchback 1.00
105 nissan two hatchback
...
CLASS = 2
Case # make num-of-doors body-style (Cls Prob)
--------------------------------------------------------------------------------
19 chevrolet two hatchback 1.00
20 chevrolet two hatchback 1.00
21 chevrolet four sedan 1.00
22 dodge two hatchback 1.00
23 dodge two hatchback 1.00
31 honda two hatchback 1.00
32 honda two hatchback 1.00
33 honda two hatchback 1.00
34 honda two hatchback 1.00
35 honda two hatchback 1.00
36 honda four sedan 1.00
37 honda four wagon 1.00
45 isuzu two sedan 1.00
46 isuzu four sedan 1.00
51 mazda two hatchback 1.00
...
CLASS = 9 (continued)
Case # make num-of-doors body-style (Cls Prob)
--------------------------------------------------------------------------------
81 mitsubishi two hatchback 1.00
88 mitsubishi four sedan 1.00
89 mitsubishi four sedan 1.00
120 plymouth two hatchback 1.00
190 volkswagen two convertible 0.99
imports-85c.case-text-1
CROSS REFERENCE: CASE NUMBER => MOST PROBABLE CLASS
AutoClass CLASSIFICATION for the 205 cases in:
/home/centauri/cook/projects/ac/sample/imports-85c.db2
/home/centauri/cook/projects/ac/sample/imports-85c.hd2
with log-A<X/H> (approximate marginal likelihood) = -16564.197
from classification results file:
/home/centauri/cook/projects/ac/sample/imports-85c.results-bin
and using models:
/home/centauri/cook/projects/ac/sample/imports-85c.model - index = 0
Case # Class Prob Case # Class Prob Case # Class Prob
--------------------------------------------------------------------------------
1 1 1.00 47 1 0.99 93 2 1.00
2 1 1.00 48 0 1.00 94 2 0.99
3 1 1.00 49 0 1.00 95 2 1.00
4 3 0.99 50 0 1.00 96 2 0.99
5 0 0.99 51 2 0.99 97 2 0.99
6 4 0.99 52 2 0.99 98 2 0.99
7 0 1.00 53 2 0.99 99 2 0.99
8 0 1.00 54 2 0.99 100 3 0.99
9 0 1.00 55 2 0.99 101 3 0.99
10 0 0.99 56 1 1.00 102 0 0.99
...
imports-85c.influ-o-text-1
...
CLASSIFICATION HAS 10 POPULATED CLASSES: (max global influence value = 7.063)
We give below a heuristic measure of class strength: the approximate
geometric mean probability for instances belonging to each class,
computed from the class parameters and statistics. This approximates
the contribution made, by any one instance "belonging" to the class,
to the log probability of the data set w.r.t. the classification. It
thus provides a heuristic measure of how strongly each class predicts
"its" instances.
Class Log of class Relative Class Normalized
num strength class strength weight class weight
0 -8.25e+01 1.64e-10 51 0.249
1 -8.01e+01 1.69e-09 39 0.190
2 -6.99e+01 4.89e-05 29 0.141
3 -6.86e+01 1.75e-04 18 0.088
4 -7.25e+01 3.58e-06 16 0.078
5 -6.86e+01 1.68e-04 14 0.068
6 -7.11e+01 1.43e-05 12 0.059
7 -5.99e+01 1.00e+00 9 0.044
8 -6.95e+01 7.20e-05 9 0.044
9 -6.95e+01 6.73e-05 8 0.039
...
ORDERED LIST OF NORMALIZED ATTRIBUTE INFLUENCE VALUES SUMMED OVER ALL CLASSES:
This gives a rough heuristic measure of relative influence of each
attribute in differentiating the classes from the overall data set.
Note that "influence values" are only computable with respect to the
model terms. When multiple attributes are modeled by a single
dependent term (e.g. multi_normal_cn), the term influence value is
distributed equally over the modeled attributes.
num description I-*k
38: Log compression-ratio 1.000
36: Log curb-weight 0.607
29: Log horse-power 0.604
2: make 0.589
37: Log engine-size 0.582
32: Log wheel-base 0.550
28: Log stroke 0.515
33: Log length 0.496
31: Log price 0.487
34: Log width 0.437
17: fuel-system 0.414
27: Log bore 0.408
26: Log normalized-loses 0.305
35: Log height 0.292
39: Log city-mpg 0.222
7: drive-wheels 0.209
40: Log highway-mpg 0.191
14: engine-type 0.160
6: body-style 0.130
3: fuel-type 0.121
5: num-of-doors 0.106
30: Log peak-rpm 0.106
15: num-of-cylinders 0.089
4: aspiration 0.075
8: engine-location 0.009
0: symboling -----
1: normalized-loses -----
...
Results
Results after removing gene duplication
Results