Bayesian Learning

Provides practical learning algorithms:

Naive Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities) with observed data
Requires prior probabilities

Provides useful conceptual framework

Provides ``gold standard'' for evaluating other learning algorithms
Additional insight into Occam's razor

Bayes Theorem

$\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}$

P(h) = prior probability of hypothesis h
P(D) = prior probability of training data D
P(h|D) = probability of h given D
P(D|h) = probability of D given h

Choosing Hypotheses

$\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}$

Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis h_MAP:

	h_MAP	$\displaystyle = \arg \max_{h \in H} P(h\vert D)$
		$\displaystyle = \arg \max_{h \in H} \frac{P(D\vert h) P(h)}{P(D)}$
		$\displaystyle = \arg \max_{h \in H}P(D\vert h) P(h)$

If assume P(h_i)=P(h_j) then can further simplify, and choose the Maximum likelihood (ML) hypothesis

$\begin{displaymath}h_{ML} = \arg \max_{h_{i} \in H} P(D\vert h_{i}) \end{displaymath}$

Bayes Theorem

Does patient have cancer or not?

A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only $98\%$ of the cases in which the disease is actually present, and a correct negative result in only $97\%$ of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

P(cancer) =
P( $\neg$ cancer) =
P(+ $\mid$ cancer) =
P(- $\mid$ cancer) =
P(+ $\mid$ $\neg$ cancer) =
P(- $\mid$ $\neg$ cancer) =

Basic Formulas for Probabilities

Product Rule: probability $P(A \wedge B)$ of a conjunction of two events A and B:

$\begin{displaymath}P(A \wedge B) = P(A\vert B) P(B) = P(B\vert A) P(A) \end{displaymath}$
Sum Rule: probability of a disjunction of two events A and B:

$\begin{displaymath}P(A \vee B) = P(A) + P(B) - P(A \wedge B) \end{displaymath}$
Theorem of total probability: if events $A_{1}, \ldots, A_{n}$ are mutually exclusive with $\sum_{i = 1}^{n} P(A_{i}) = 1$ , then

$\begin{displaymath}P(B) = \sum_{i = 1}^{n} P(B\vert A_{i}) P(A_{i})\end{displaymath}$

Brute Force MAP Hypothesis Learner

1.

For each hypothesis h in H, calculate the posterior probability

$\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}$

2.

Output the hypothesis h_MAP with the highest posterior probability

$\begin{displaymath}h_{MAP} = \mathop{\rm argmax}_{h \in H} P(h\vert D)\end{displaymath}$

Most Probable Classification of New Instances

So far we've sought the most probable hypothesis given the data D(i.e., h_MAP)

Given new instance x, what is its most probable classification?

h_MAP(x) is not the most probable classification!

Consider:

Three possible hypotheses:

$P(h_{1}\vert D)=.4, \ P(h_{2}\vert D)=.3, \ P(h_{3}\vert D)=.3$
Given new instance x,

$h_{1}(x)=+, \ h_{2}(x)=-, \ h_{3}(x)=-$
What's most probable classification of x?

Bayes Optimal Classifier

Bayes optimal classification:

$\begin{displaymath}\arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\vert h_{i}) P(h_{i}\vert D)\end{displaymath}$

Example:

P(h₁\|D)=.4,	P(-\|h₁)=0,	P(+\|h₁)=1
P(h₂\|D)=.3,	P(-\|h₂)=1,	P(+\|h₂)=0
P(h₃\|D)=.3,	P(-\|h₃)=1,	P(+\|h₃)=0

therefore

$\displaystyle \sum_{h_{i} \in H} P(+\vert h_{i}) P(h_{i}\vert D)$	=	.4
$\displaystyle \sum_{h_{i} \in H} P(-\vert h_{i}) P(h_{i}\vert D)$	=	.6

and

$\displaystyle \arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\vert h_{i}) P(h_{i}\vert D)$

Naive Bayes Classifier

Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods.

When to use

Moderate or large training set available
Attributes that describe instances are conditionally independent given classification

Successful applications:

Diagnosis
Classifying text documents

Naive Bayes Classifier

Assume target function $f: X \rightarrow V$ , where each instance x described by attributes $\langle a_{1}, a_{2} \ldots a_{n} \rangle$ .

Most probable value of f(x) is:

v_MAP	=	$\displaystyle \mathop{\rm argmax}_{v_{j} \in V} P(v_{j} \vert a_{1}, a_{2} \ldots a_{n})$
v_MAP	=	$\displaystyle \mathop{\rm argmax}_{v_{j} \in V} \frac{P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) P(v_{j})}{P(a_{1}, a_{2} \ldots a_{n})}$
	=	$\displaystyle \mathop{\rm argmax}_{v_{j} \in V} P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) P(v_{j})$

Naive Bayes assumption:

$\begin{displaymath}P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) = \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}$

which gives

$\begin{displaymath}\mbox{\bf Naive Bayes classifier: } v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j}) \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}$

Naive Bayes Algorithm

Naive_Bayes_Learn(examples)

For each target value v_j

$\hat{P}(v_j) \leftarrow$ estimate P(v_j)

For each attribute value a_i of each attribute a

: $\hat{P}(a_i\vert v_j) \leftarrow$ estimate P(a_i|v_j)

Classify_New_Instance(x)

$\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} \hat{P}(v_{j}) \prod_{a_i \in x} \hat{P}(a_{i} \vert v_{j}) \end{displaymath}$

Naive Bayes: Example

Day	Outlook	Temperature	Humidity	Wind	PlayTennis
D1	Sunny	Hot	High	Weak	No
D2	Sunny	Hot	High	Strong	No
D3	Overcast	Hot	High	Weak	Yes
D4	Rain	Mild	High	Weak	Yes
D5	Rain	Cool	Normal	Weak	Yes
D6	Rain	Cool	Normal	Strong	No
D7	Overcast	Cool	Normal	Strong	Yes
D8	Sunny	Mild	High	Weak	No
D9	Sunny	Cool	Normal	Weak	Yes
D10	Rain	Mild	Normal	Weak	Yes
D11	Sunny	Mild	Normal	Strong	Yes
D12	Overcast	Mild	High	Strong	Yes
D13	Overcast	Hot	Normal	Weak	Yes
D14	Rain	Mild	High	Strong	No

Consider PlayTennis again, and new instance

$\begin{displaymath}\langle Outlk=sun, Temp=cool, Humid=high, Wind=strong \rangle \end{displaymath}$

Want to compute:

$\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j}) \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}$

$\begin{displaymath}P(y)\ P(sun\vert y)\ P(cool\vert y)\ P(high\vert y)\ P(strong\vert y) = .005 \end{displaymath}$

$\begin{displaymath}P(n)\ P(sun\vert n)\ P(cool\vert n)\ P(high\vert n)\ P(strong\vert n) = .021 \end{displaymath}$

$\begin{displaymath}\rightarrow v_{NB} = n \end{displaymath}$

Naive Bayes: Subtleties

1.

Conditional independence assumption is often violated

$\begin{displaymath}P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) = \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}$

...but it works surprisingly well anyway. Note don't need estimated posteriors $\hat{P}(v_j\vert x)$ to be correct; need only that

$\begin{displaymath}\mathop{\rm argmax}_{v_{j} \in V} \hat{P}(v_{j}) \prod_{i} \h... ...rgmax}_{v_{j} \in V} P(v_{j}) P(a_{1} \ldots, a_n \vert v_{j}) \end{displaymath}$

see [Domingos & Pazzani, 1996] for analysis
Naive Bayes posteriors often unrealistically close to 1 or 0

Naive Bayes: Subtleties

2.

what if none of the training instances with target value v_j have attribute value a_i? Then

$\begin{displaymath}\hat{P}(a_i\vert v_j) = 0 \mbox{, and...}\end{displaymath}$

$\begin{displaymath}\hat{P}(v_{j}) \prod_{i} \hat{P}(a_{i} \vert v_{j}) = 0 \end{displaymath}$

Typical solution is Bayesian estimate for $\hat{P}(a_{i} \vert v_{j})$

$\begin{displaymath}\hat{P}(a_{i} \vert v_{j}) \leftarrow\frac{n_{c} + mp}{n + m} \end{displaymath}$

where

n is number of training examples for which v=v_j,
n_c number of examples for which v=v_j and a=a_i
p is prior estimate for $\hat{P}(a_{i} \vert v_{j})$
m is weight given to prior (i.e. number of ``virtual'' examples)

Learning to Classify Text

Why?

Learn which news articles are of interest
Learn to classify web pages by topic

Naive Bayes is among most effective algorithms

What attributes shall we use to represent text documents??

Learning to Classify Text

Target concept $Interesting? : Document \rightarrow\{+,-\}$

1.

Represent each document by vector of words

one attribute per word position in document

2.

Learning: Use training examples to estimate

P(+)
P(-)
P(doc|+)
P(doc|-)

Naive Bayes conditional independence assumption

$\begin{displaymath}P(doc\vert v_j) = \prod_{i=1}^{length(doc)} P(a_i=w_k \vert v_j) \end{displaymath}$

where P(a_i=w_k| v_j) is probability that word in position i is w_k, given v_j

one more assumption: $P(a_i=w_k\vert v_j) = P(a_m=w_k\vert v_j), \forall i,m$

Pseudocode

LEARN/SMALL>_NAIVE/SMALL>_BAYES/SMALL>_TEXT( Examples, V)

: 1. collect all words and other tokens that occur in Examples
$\mbox{$\bullet$}$: $Vocabulary \leftarrow$ all distinct words and other tokens in Examples
: 2. calculate the required P(v_j) and P(w_k|v_j) probability terms
$\mbox{$\bullet$}$: For each target value v_j in V do

$docs_{j} \leftarrow$ subset of Examples for which the target value is v_j

$P(v_{j}) \leftarrow\frac{\vert docs_{j}\vert}{\vert Examples\vert}$

$Text_{j} \leftarrow$ a single document created by concatenating all members of docs_j

$n \leftarrow$ total number of words in Text_j (counting duplicate words multiple times)

for each word w_k in Vocabulary

$n_{k} \leftarrow$ number of times word w_k occurs in Text_j

$P(w_{k}\vert v_{j}) \leftarrow\frac{n_{k} + 1}{n + \vert Vocabulary\vert}$

CLASSIFY/SMALL>_NAIVE/SMALL>_BAYES/SMALL>_TEXT(Doc)

$positions \leftarrow$ all word positions in Doc that contain tokens found in Vocabulary

Return v_NB, where

$\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j}) \prod_{i \in positions}P(a_{i}\vert v_{j}) \end{displaymath}$

Twenty NewsGroups

Given 1000 training documents from each group

Learn to classify new documents according to which newsgroup it came from

comp.graphics	misc.forsale
comp.os.ms-windows.misc	rec.autos
comp.sys.ibm.pc.hardware	rec.motorcycles
comp.sys.mac.hardware	rec.sport.baseball
comp.windows.x	rec.sport.hockey

alt.atheism	sci.space
soc.religion.christian	sci.crypt
talk.religion.misc	sci.electronics
talk.politics.mideast	sci.med
talk.politics.misc
talk.politics.guns

Naive Bayes: 89% classification accuracy

Article from rec.sport.hockey

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu
From: xxx@yyy.zzz.edu (John Doe)
Subject: Re: This year's biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most 
obvious candidate for pleasant surprise is Alex
Zhitnik. He came highly touted as a defensive 
defenseman, but he's clearly much more than that. 
Great skater and hard shot (though wish he were 
more accurate). In fact, he pretty much allowed 
the Kings to trade away that huge defensive 
liability Paul Coffey. Kelly Hrudey is only the 
biggest disappointment if you thought he was any 
good to begin with. But, at best, he's only a 
mediocre goaltender. A better choice would be 
Tomas Sandstrom, though not through any fault of 
his own, but because some thugs in Toronto decided

Learning Curve for 20 Newsgroups

$\psfig{figure=figures/bayes-text-results.ps}$

Accuracy vs. Training set size (1/3 withheld for test)

Bayesian Belief Networks

Interesting because:

$\mbox{$\bullet$}$: Naive Bayes assumption of conditional independence too restrictive
$\mbox{$\bullet$}$: But it's intractable without some such assumptions...
$\mbox{$\bullet$}$: Bayesian Belief networks describe conditional independence among subsets of variables
$\rightarrow$: allows combining prior knowledge about (in)dependencies among variables with observed training data

(also called Bayes Nets)

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Ygiven the value of Z; that is, if

$\begin{displaymath}(\forall x_i,y_j,z_k) \ P(X = x_i \vert Y = y_j, Z = z_k) = P(X = x_i \vert Z = z_k) \end{displaymath}$

more compactly, we write

P(X | Y,Z) = P(X | Z)

Example: Thunder is conditionally independent of Rain, given Lightning

P(Thunder | Rain, Lightning) = P(Thunder | Lightning)

Naive Bayes uses cond. indep. to justify

P(X,Y\|Z)	=	P(X\|Y,Z) P(Y\|Z)
	=	P(X\|Z) P(Y\|Z)

Bayesian Belief Network

$\psfig{figure=figures/bayesnet.ps}$

Network represents a set of conditional independence assertions:

Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
Directed acyclic graph

Bayesian Belief Networks

A belief network represents the dependence between variables.

* nodes

* links

* conditional probability tables

In the Spotlight

Online Airline Pricing

P(h₁\|D)=.4,	P(-\|h₁)=0,	P(+\|h₁)=1
P(h₂\|D)=.3,	P(-\|h₂)=1,	P(+\|h₂)=0
P(h₃\|D)=.3,	P(-\|h₃)=1,	P(+\|h₃)=0