Consider hypotheses H1 and H2 learned by learners L1 and
L2
- How to learn H and estimate accuracy with limited data?
- How well does observed accuracy of H over limited sample estimate
accuracy over unseen data?
- If H1 outperforms H2 on sample, will H1 outperform H2 in
general?
- Same conclusion for L1 and L2?
If
-
S contains n examples, drawn independently of h and each other
-
Then
errorS(h) follows a Binomial distribution, with
- mean
- standard deviation
Approximate this by a Normal distribution with
- mean
- standard deviation
If
- S contains n examples, drawn independently of h and each other
Then
- With approximately 95% probability,
errorS(h) lies in interval
equivalently,
lies in interval
which is approximately
- 1.
- Pick parameter p to estimate
- 2.
- Choose an estimator
- 3.
- Determine probability distribution that governs estimator
-
errorS(h) governed by Binomial distribution, approximated by
Normal when
- 4.
- Find interval (L, U) such that N% of probability falls in the interval
Test h1 on sample S1, test h2 on S2
- 1.
- Pick parameter to estimate
- 2.
- Choose an estimator
- 3.
- Determine probability distribution that governs estimator
- 4.
- Find interval (L, U) such that N% of probability mass falls
in the interval
P(errorD(h1) > errorD(h2)) = ?
- Example
-
-
errorS1(h1) = 0.30
-
-
errorS2(h2) = 0.20
-
-
-
-
-
= probability
does not
overestimate d by more than 0.10
-
-
-
-
zN = 1.64
-
- I.e., reject null hypothesis with 0.05 level of significance
- 1.
- Partition data into k disjoint test sets
of equal size, where this size is at least 30.
- 2.
- For i from 1 to k, do
-
-
-
-
- 3.
- Return the value
,
where
N% confidence interval estimate for d:
Note
approximately Normally distributed
- Good for comparing two learners, not for multiple pairs
- Determining probability of rejecting null hypothesis (learners perform
equally)
What we'd like to estimate:
where L(S) is the hypothesis output by learner L using training set S
i.e., the expected difference in true error between hypotheses output by
learners LA and LB, when trained using randomly selected training sets
S drawn according to distribution .
But, given limited data D0, what is a good estimator?
- 1.
- Partition data D0 into k disjoint test sets
of equal size, where this size is at least 30.
- 2.
- For i from 1 to k, do
-
-
use Ti for the test set, and the remaining data for training set
Si
-
-
-
-
-
-
-
-
- 3.
- Return the value
,
where
- 4.
- This is an approximation (not really correct because training sets are
not independent, they overlap)
- Useful when comparing a large number of learning systems
- Many pairwise comparisons to make
- Is the set of significance values significant?
- Let j = number of groups
- Let k = number of trials per group
Increased F leads to decreased P(means are equal)
- Degrees of Freedom for numerator = j-1
- Degrees of freedom for denominator = j(k-1)
- Look up value in table
Method 1: Learn decision tree, convert to rules
Method 2: Sequential covering algorithm:
- 1.
- Learn one rule with high accuracy, any coverage
- 2.
- Remove positive examples covered by this rule
- 3.
- Repeat
SEQUENTIAL-COVERING(
)
LEARN-ONE-RULE
- 1.
- May use beam search
- 2.
- Easily generalizes to multi-valued target functions
- 3.
- Choose evaluation function to guide search:
- Entropy (i.e., information gain)
- Sample accuracy:
where nc = correct rule predictions, n = all predictions
- m estimate:
-
Sequential or simultaneous covering of data?
-
General
specific, or specific
general?
-
Generate-and-test, or example-driven?
-
Whether and how to post-prune?
-
What statistical evaluation function?
Why do that?
- Can learn sets of rules such as
- General purpose programming language PROLOG: programs are sets of
such rules
[Slattery, 1997]
course(A)
has-word(A, instructor),
Not has-word(A, good),
link-from(A, B),
has-word(B, assign),
Not link-from(B, C)
Train: 31/31, Test: 31/34
FOIL(
)
Learning rule:
Candidate specializations add new literal of form:
-
,
where at least one of the vi in the created
literal must already exist as a variable in the rule.
-
Equal(xj,xk), where xj and xk are variables already present
in the rule
-
The negation of either of the above forms of literals
Where
- L is the candidate literal to add to rule R
- p0 = number of positive bindings of R
- n0 = number of negative bindings of R
- p1 = number of positive bindings of R+L
- n1 = number of negative bindings of R+L
- t is the number of positive bindings of R also covered by R+L
Note
-
is optimal number of bits to
indicate the class of a positive binding covered by R
Instances:
-
pairs of nodes, e.g
,
with graph described by
literals LinkedTo(0,1), LinkedTo(0,8) etc.
Target function:
- CanReach(x,y) true iff directed path from x to y
Hypothesis space:
- Each
is a set of horn clauses using predicates LinkedTo(and CanReach)