# Bayesian Learning

Provides practical learning algorithms:

• Naive Bayes learning
• Bayesian belief network learning
• Combine prior knowledge (prior probabilities) with observed data
• Requires prior probabilities

Provides useful conceptual framework

• Provides gold standard'' for evaluating other learning algorithms
• Additional insight into Occam's razor

# Bayes Theorem

• P(h) = prior probability of hypothesis h
• P(D) = prior probability of training data D
• P(h|D) = probability of h given D
• P(D|h) = probability of D given h

# Choosing Hypotheses

Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis hMAP:

 hMAP

If assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis

# Bayes Theorem

Does patient have cancer or not?

A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only of the cases in which the disease is actually present, and a correct negative result in only of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

P(cancer) =
P( cancer) =
P(+ cancer) =
P(- cancer) =
P(+ cancer) =
P(- cancer) =

# Basic Formulas for Probabilities

• Product Rule: probability of a conjunction of two events A and B:

• Sum Rule: probability of a disjunction of two events A and B:

• Theorem of total probability: if events are mutually exclusive with , then

# Brute Force MAP Hypothesis Learner

1.
For each hypothesis h in H, calculate the posterior probability

2.
Output the hypothesis hMAP with the highest posterior probability

# Most Probable Classification of New Instances

So far we've sought the most probable hypothesis given the data D(i.e., hMAP)

Given new instance x, what is its most probable classification?

• hMAP(x) is not the most probable classification!

Consider:

• Three possible hypotheses:
• Given new instance x,
• What's most probable classification of x?

# Bayes Optimal Classifier

Bayes optimal classification:

Example:

 P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1 P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0

therefore
 = 0.4 = 0.6

and
 = -

# Naive Bayes Classifier

Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods.

When to use

• Moderate or large training set available
• Attributes that describe instances are conditionally independent given classification

Successful applications:

• Diagnosis
• Classifying text documents

# Naive Bayes Classifier

Assume target function , where each instance x described by attributes .

Most probable value of f(x) is:

 vMAP = vMAP = =

Naive Bayes assumption:

which gives

# Naive Bayes Algorithm

Naive_Bayes_Learn(examples)

For each target value vj
estimate P(vj)
For each attribute value ai of each attribute a
estimate P(ai|vj)

Classify_New_Instance(x)

# Naive Bayes: Example

 Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

Consider PlayTennis again, and new instance

Want to compute:

# Naive Bayes: Subtleties

1.
Conditional independence assumption is often violated

• ...but it works surprisingly well anyway. Note don't need estimated posteriors to be correct; need only that

• see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0

# Naive Bayes: Subtleties

2.
what if none of the training instances with target value vj have attribute value ai? Then

Typical solution is Bayesian estimate for

where
• n is number of training examples for which v=vj,
• nc number of examples for which v=vj and a=ai
• p is prior estimate for
• m is weight given to prior (i.e. number of virtual'' examples)

# Learning to Classify Text

Why?

• Learn which news articles are of interest
• Learn to classify web pages by topic

Naive Bayes is among most effective algorithms

What attributes shall we use to represent text documents??

# Learning to Classify Text

Target concept

1.
Represent each document by vector of words
• one attribute per word position in document
2.
Learning: Use training examples to estimate
• P(+)
• P(-)
• P(doc|+)
• P(doc|-)
Naive Bayes conditional independence assumption

where P(ai=wk| vj) is probability that word in position i is wk, given vj

one more assumption:

# Pseudocode

LEARN/SMALL>_NAIVE/SMALL>_BAYES/SMALL>_TEXT( Examples, V)

1. collect all words and other tokens that occur in Examples

all distinct words and other tokens in Examples

2. calculate the required P(vj) and P(wk|vj) probability terms

For each target value vj in V do
• subset of Examples for which the target value is vj

• a single document created by concatenating all members of docsj

• total number of words in Textj (counting duplicate words multiple times)

• for each word wk in Vocabulary
• number of times word wk occurs in Textj

CLASSIFY/SMALL>_NAIVE/SMALL>_BAYES/SMALL>_TEXT(Doc)

• all word positions in Doc that contain tokens found in Vocabulary

• Return vNB, where

# Twenty NewsGroups

Given 1000 training documents from each group

Learn to classify new documents according to which newsgroup it came from

 comp.graphics misc.forsale comp.os.ms-windows.misc rec.autos comp.sys.ibm.pc.hardware rec.motorcycles comp.sys.mac.hardware rec.sport.baseball comp.windows.x rec.sport.hockey alt.atheism sci.space soc.religion.christian sci.crypt talk.religion.misc sci.electronics talk.politics.mideast sci.med talk.politics.misc talk.politics.guns

Naive Bayes: 89% classification accuracy

# Article from rec.sport.hockey

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year's biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most
obvious candidate for pleasant surprise is Alex
Zhitnik. He came highly touted as a defensive
defenseman, but he's clearly much more than that.
Great skater and hard shot (though wish he were
more accurate). In fact, he pretty much allowed
the Kings to trade away that huge defensive
liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any
good to begin with. But, at best, he's only a
mediocre goaltender. A better choice would be
Tomas Sandstrom, though not through any fault of
his own, but because some thugs in Toronto decided


# Learning Curve for 20 Newsgroups

Accuracy vs. Training set size (1/3 withheld for test)

# Bayesian Belief Networks

Interesting because:

Naive Bayes assumption of conditional independence too restrictive

But it's intractable without some such assumptions...

Bayesian Belief networks describe conditional independence among subsets of variables
allows combining prior knowledge about (in)dependencies among variables with observed training data

(also called Bayes Nets)

# Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Ygiven the value of Z; that is, if

more compactly, we write

P(X | Y,Z) = P(X | Z)

Example: Thunder is conditionally independent of Rain, given Lightning

P(Thunder | Rain, Lightning) = P(Thunder | Lightning)

Naive Bayes uses cond. indep. to justify

 P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)

# Bayesian Belief Network

Network represents a set of conditional independence assertions:

• Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
• Directed acyclic graph

# Bayesian Belief Networks

A belief network represents the dependence between variables.

* nodes