Evaluating Hypotheses

Consider hypotheses H₁ and H₂ learned by learners L₁ and L₂

How to learn H and estimate accuracy with limited data?
How well does observed accuracy of H over limited sample estimate accuracy over unseen data?
If H₁ outperforms H₂ on sample, will H₁ outperform H₂ in general?
Same conclusion for L₁ and L₂?

Confidence Intervals

S contains n examples, drawn independently of h and each other
$n \geq 30$

Then

With approximately 95% probability, lies in interval

$\begin{displaymath}error_{S}(h) \pm 1.96 \sqrt{\frac{error_{S}(h) (1 - error_{S}(h))}{n}} \end{displaymath}$
- With approximately N% probability, $error_{\cal{D}}(h)$ lies in interval
  
  $\begin{displaymath}error_{S}(h) \pm z_{N} \sqrt{\frac{error_{S}(h) (1 - error_{S}(h))}{n}}\end{displaymath}$
  
  where
  
  $N\%$ : 50% 68% 80% 90% 95% 98% 99%
  
  z_N: 0.67 1.00 1.28 1.64 1.96 2.33 2.58
Example: n=40, r=12, error_s(h)=.3, 95% confidence interval is .30 $\pm$ .14

Two-Sided and One-Sided Bounds

$\psfig{figure=figures/g2.ps}$ $\psfig{figure=figures/g3.ps}$

If $\mu - z_N \sigma \leq y \leq \mu + z_N \sigma$ with confidence $N = 100(1 - \alpha) \%$
Then $-\infty \leq y \leq \mu + z_N \sigma$ with confidence $N = 100(1 - \alpha / 2) \%$
and
$\mu - z_N \sigma \leq y \leq +\infty$ with confidence $N = 100(1 - \alpha / 2) \%$
Example: n = 40, r = 12
- Two-sided, 95% confidence ( $\alpha = 0.05$ )
  
  $\begin{displaymath}P(0.16 \leq y \leq 0.44) = 0.95 \end{displaymath}$
- One-sided
  
  $\begin{displaymath}P(y \leq 0.44) = P(y \geq 0.16) = (1 - \alpha / 2) = 0.975 \end{displaymath}$

Normal Distribution Approximates Binomial

error_S(h) follows a Binomial distribution, with

mean $\mu_{error_{S}(h)} = error_{\cal{D}}(h)$
standard deviation $\sigma_{error_{S}(h)}$

$\begin{displaymath}\sigma_{error_{S}(h)} = \sqrt{\frac{error_{\cal{D}}(h) (1 - error_{\cal{D}}(h))}{n}} \end{displaymath}$

Approximate this by a Normal distribution with

mean $\mu_{error_{S}(h)} = error_{\cal{D}}(h)$
standard deviation $\sigma_{error_{S}(h)}$

$\begin{displaymath}\sigma_{error_{S}(h)} \approx \sqrt{\frac{error_{S}(h) (1 - error_{S}(h))}{n}} \end{displaymath}$

Confidence Intervals, More Correctly

S contains n examples, drawn independently of h and each other
$n \geq 30$

Then

With approximately 95% probability, error_S(h) lies in interval

$\begin{displaymath}error_{\cal{D}}(h) \pm 1.96 \sqrt{\frac{error_{\cal{D}}(h) (1 - error_{\cal{D}}(h))}{n}} \end{displaymath}$

equivalently, $error_{\cal{D}}(h)$ lies in interval

$\begin{displaymath}error_{S}(h) \pm 1.96 \sqrt{\frac{error_{\cal{D}}(h) (1 - error_{\cal{D}}(h))}{n}}\end{displaymath}$

which is approximately

$\begin{displaymath}error_{S}(h) \pm 1.96 \sqrt{\frac{error_{S}(h) (1 - error_{S}(h))}{n}} \end{displaymath}$

Calculating Confidence Intervals

1.

Pick parameter p to estimate

$error_{\cal{D}}(h)$

2.

Choose an estimator

error_S(h)

3.

Determine probability distribution that governs estimator

error_S(h) governed by Binomial distribution, approximated by Normal when $n \geq 30$

4.

Find interval (L, U) such that N% of probability falls in the interval

Use table of z_N values

Difference Between Hypotheses

Test h₁ on sample S₁, test h₂ on S₂

1.

Pick parameter to estimate

$\begin{displaymath}d \equiv error_{\cal{D}}(h_{1}) - error_{\cal{D}}(h_{2}) \end{displaymath}$

2.

Choose an estimator

$\begin{displaymath}\hat{d} \equiv error_{S_{1}}(h_{1}) - error_{S_{2}}(h_{2}) \end{displaymath}$

3.

Determine probability distribution that governs estimator

$\begin{displaymath}\small\sigma_{\hat{d}} \approx \sqrt{\frac{error_{S_{1}}(h_{1... ...+ \frac{error_{S_{2}}(h_{2})(1 - error_{S_{2}}(h_{2}))}{n_{2}}}\end{displaymath}$

4.

Find interval (L, U) such that N% of probability mass falls in the interval

$\begin{displaymath}\hat{d} \pm z_{N} \sqrt{\frac{error_{S_{1}}(h_{1})(1 - error_... ...\frac{error_{S_{2}}(h_{2})(1 - error_{S_{2}}(h_{2}))}{n_{2}} } \end{displaymath}$

Hypothesis Testing

P(error_D(h₁) > error_D(h₂)) = ?

Example

$\circ$
error_S₁(h₁) = 0.30
$\circ$
error_S₂(h₂) = 0.20
$\circ$
$\hat{d} = 0.10$
$\circ$
$\sigma_{\hat{d}} = 0.061$
$P(\hat{d} < \mu_{\hat{d}} + 0.10)$ = probability $\hat{d}$ does not overestimate d by more than 0.10

$\circ$
$z_N \cdot \sigma_{\hat{d}} = 0.10$
$\circ$
z_N = 1.64
$P(\hat{d} < \mu_{\hat{d}} + 1.64 \sigma_{\hat{d}}) = 0.95$
I.e., reject null hypothesis with 0.05 level of significance

Paired t test to compare h_A,h_B

1.

Partition data into k disjoint test sets $T_{1}, T_{2}, \ldots, T_{k}$ of equal size, where this size is at least 30.

2.

For i from 1 to k, do

: $\delta_{i} \leftarrow error_{T_{i}}(h_{A}) - error_{T_{i}}(h_{B})$

3.

Return the value $\bar{\delta}$ , where

$\begin{displaymath}\bar{\delta} \equiv \frac{1}{k}\sum_{i=1}^{k} \delta_{i}\end{displaymath}$

N% confidence interval estimate for d:

$\begin{displaymath}\bar{\delta} \pm t_{N,k-1} \ s_{\bar{\delta}} \end{displaymath}$

$\begin{displaymath}s_{\bar{\delta}} \equiv \sqrt{\frac{1}{k(k-1)} \sum_{i=1}^{k}(\delta_{i} - \bar{\delta})^{2}} \end{displaymath}$

Note $\delta_{i}$ approximately Normally distributed

Good for comparing two learners, not for multiple pairs
Determining probability of rejecting null hypothesis (learners perform equally)

Comparing learning algorithms L_A and L_B

What we'd like to estimate:

$\begin{displaymath}E_{S \subset \cal{D}}[ error_{\cal{D}}(L_{A}(S)) - error_{\cal{D}}(L_{B}(S))] \end{displaymath}$

where L(S) is the hypothesis output by learner L using training set S

i.e., the expected difference in true error between hypotheses output by learners L_A and L_B, when trained using randomly selected training sets S drawn according to distribution $\cal{D}$ .

But, given limited data D₀, what is a good estimator?

could partition D₀ into training set S and training set T₀, and measure

error_T₀(L_A(S₀)) - error_T₀(L_B(S₀))
even better, repeat this many times and average the results (next slide)

Comparing learning algorithms L_A and L_B

1.

Partition data D₀ into k disjoint test sets $T_{1}, T_{2}, \ldots, T_{k}$ of equal size, where this size is at least 30.

2.

For i from 1 to k, do

: use T_i for the test set, and the remaining data for training set S_i
$\mbox{$\bullet$}$: $S_{i} \leftarrow\{D_{0} - T_{i}\}$
$\mbox{$\bullet$}$: $h_{A} \leftarrow L_{A}(S_{i})$
$\mbox{$\bullet$}$: $h_{B} \leftarrow L_{B}(S_{i})$
$\mbox{$\bullet$}$: $\delta_{i} \leftarrow error_{T_{i}}(h_{A}) - error_{T_{i}}(h_{B})$

3.

Return the value $\bar{\delta}$ , where

$\begin{displaymath}\bar{\delta} \equiv \frac{1}{k}\sum_{i=1}^{k} \delta_{i}\end{displaymath}$

4.

This is an approximation (not really correct because training sets are not independent, they overlap)

Analysis of Variance (ANOVA)

Useful when comparing a large number of learning systems
Many pairwise comparisons to make
Is the set of significance values significant?
Let j = number of groups
Let k = number of trials per group

$\begin{displaymath}{\rm ANOVA:} \;\; F = \frac{MS_{between}}{MS_{within}} \end{displaymath}$

Increased F leads to decreased P(means are equal)

Degrees of Freedom for numerator = j-1
Degrees of freedom for denominator = j(k-1)
Look up value in table

Learning Disjunctive Sets of Rules

Method 1: Learn decision tree, convert to rules

Method 2: Sequential covering algorithm:

1.: Learn one rule with high accuracy, any coverage
2.: Remove positive examples covered by this rule
3.: Repeat

Sequential Covering Algorithm

SEQUENTIAL-COVERING( $Target\_attribute, Attributes, Examples, Threshold$ )

$Learned\_rules \leftarrow\{\}$
$Rule \leftarrow$ LEARN-ONE-RULE $(Target\_attribute, Attributes, Examples)$
while PERFORMANCE( Rule, Examples) > Threshold, do
- $Learned\_rules \leftarrow Learned\_rules + Rule$
- $Examples \leftarrow Examples \ -$ {examples correctly classified by Rule}
- $Rule \leftarrow$ LEARN-ONE-RULE $(Target\_attribute, Attributes, Examples)$
$Learned\_rules \leftarrow$ sort $Learned\_rules$ accord to PERFORMANCE over Examples
return $Learned\_rules$

Learn-One-Rule

$\psfig{figure=figures/learn-one-rule.ps}$

LEARN-ONE-RULE

$Pos \leftarrow$ positive Examples
$Neg \leftarrow$ negative Examples
while Pos, do

Learn a NewRule

$\mbox{$\mathbf{\circ}$}$
$NewRule \leftarrow$ most general rule possible

$\mbox{$\mathbf{\circ}$}$
$NewRuleNeg \leftarrow Neg$

$\mbox{$\mathbf{\circ}$}$
while NewRuleNeg, do

Add a new literal to specialize NewRule

1.
$Candidate\_literals \leftarrow$ generate candidates

2.
$Best\_literal \leftarrow\mathop{\rm argmax}_{L \in Candidate\_literals}$
Performance(SpecializeRule(NewRule,L))

3.
add $Best\_literal$ to NewRule preconditions

4.
$NewRuleNeg \leftarrow$ subset of NewRuleNeg that satisfies NewRulepreconditions

$\mbox{$\mathbf{\circ}$}$
$Learned\_rules \leftarrow Learned\_rules + NewRule$
$\mbox{$\mathbf{\circ}$}$
$Pos \leftarrow Pos \ -$ {members of Pos covered by NewRule}
Return $Learned\_rules$

Subtleties: Learn One Rule

1.

May use beam search

2.

Easily generalizes to multi-valued target functions

3.

Choose evaluation function to guide search:

Entropy (i.e., information gain)
Sample accuracy:

$\begin{displaymath}\frac{n_c}{n} \end{displaymath}$

where n_c = correct rule predictions, n = all predictions
m estimate:

$\begin{displaymath}\frac{n_c + mp}{n + m}\end{displaymath}$

Variants of Rule Learning Programs

Sequential or simultaneous covering of data?
General $\rightarrow$ specific, or specific $\rightarrow$ general?
Generate-and-test, or example-driven?
Whether and how to post-prune?
What statistical evaluation function?

Learning First Order Rules

Why do that?

Can learn sets of rules such as
General purpose programming language PROLOG: programs are sets of such rules

First Order Rule for Classifying Web Pages

[Slattery, 1997]


course(A) 


has-word(A, instructor),


Not has-word(A, good),


link-from(A, B),


has-word(B, assign),


Not link-from(B, C)




Train: 31/31, Test: 31/34

FOIL( $Target\_predicate, Predicates, Examples$ )

$Pos \leftarrow$ positive Examples
$Neg \leftarrow$ negative Examples
while Pos, do

Learn a NewRule

$\mbox{$\mathbf{\circ}$}$
$NewRule \leftarrow$ most general rule possible

$\mbox{$\mathbf{\circ}$}$
$NewRuleNeg \leftarrow Neg$

$\mbox{$\mathbf{\circ}$}$
while NewRuleNeg, do

Add a new literal to specialize NewRule

1.
$Candidate\_literals \leftarrow$ generate candidates

2.
$Best\_literal \leftarrow\mathop{\rm argmax}_{L \in Candidate\_literals} Foil\_Gain(L, NewRule)$

3.
add $Best\_literal$ to NewRule preconditions

4.
$NewRuleNeg \leftarrow$ subset of NewRuleNeg that satisfies NewRulepreconditions

$\mbox{$\mathbf{\circ}$}$
$Learned\_rules \leftarrow Learned\_rules + NewRule$
$\mbox{$\mathbf{\circ}$}$
$Pos \leftarrow Pos \ -$ {members of Pos covered by NewRule}
Return $Learned\_rules$

Specializing Rules in FOIL

Learning rule: $P(x_{1},x_{2}, \ldots, x_{k}) \leftarrow L_{1} \ldots L_{n}$

Candidate specializations add new literal of form:

$Q(v_{1},\ldots,v_{r})$ , where at least one of the v_i in the created literal must already exist as a variable in the rule.
Equal(x_j,x_k), where x_j and x_k are variables already present in the rule
The negation of either of the above forms of literals

Information Gain in FOIL

$\begin{displaymath}Foil\_Gain(L,R) \equiv t \left( \log_{2}\frac{p_{1}}{p_{1}+n_{1}} - \log_{2}\frac{p_{0}}{p_{0}+n_{0}} \right) \end{displaymath}$

Where

L is the candidate literal to add to rule R
p₀ = number of positive bindings of R
n₀ = number of negative bindings of R
p₁ = number of positive bindings of R+L
n₁ = number of negative bindings of R+L
t is the number of positive bindings of R also covered by R+L

Note

$-\log_{2}\frac{p_{0}}{p_{0}+n_{0}}$ is optimal number of bits to indicate the class of a positive binding covered by R

FOIL Example

$\psfig{figure=figures/foil.ps}$

Instances:

pairs of nodes, e.g $\langle 1,5 \rangle$ , with graph described by literals LinkedTo(0,1), $\neg$ LinkedTo(0,8) etc.

Target function:

CanReach(x,y) true iff directed path from x to y

Hypothesis space:

Each $h \in H$ is a set of horn clauses using predicates LinkedTo(and CanReach)

$N\%$ :	50%	68%	80%	90%	95%	98%	99%
z_N:	0.67	1.00	1.28	1.64	1.96	2.33	2.58