Bayesian Learning

Provides practical learning algorithms:

Provides useful conceptual framework

Bayes Theorem

\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}

Choosing Hypotheses

\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}

Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis hMAP:

  hMAP $\displaystyle = \arg \max_{h \in H} P(h\vert D)$  
    $\displaystyle = \arg \max_{h \in H} \frac{P(D\vert h) P(h)}{P(D)}$  
    $\displaystyle = \arg \max_{h \in H}P(D\vert h) P(h)$  

If assume P(hi)=P(hj) then can further simplify, and choose the Maximum likelihood (ML) hypothesis

\begin{displaymath}h_{ML} = \arg \max_{h_{i} \in H} P(D\vert h_{i}) \end{displaymath}

Bayes Theorem

Does patient have cancer or not?

A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only $98\%$ of the cases in which the disease is actually present, and a correct negative result in only $97\%$ of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer.

P(cancer) =
P($\neg$ cancer) =
P(+ $\mid$ cancer) =
P(- $\mid$ cancer) =
P(+ $\mid$ $\neg$ cancer) =
P(- $\mid$ $\neg$ cancer) =

Basic Formulas for Probabilities

Brute Force MAP Hypothesis Learner

For each hypothesis h in H, calculate the posterior probability

\begin{displaymath}P(h\vert D) = \frac{P(D\vert h) P(h)}{P(D)} \end{displaymath}

Output the hypothesis hMAP with the highest posterior probability

\begin{displaymath}h_{MAP} = \mathop{\rm argmax}_{h \in H} P(h\vert D)\end{displaymath}

Most Probable Classification of New Instances

So far we've sought the most probable hypothesis given the data D(i.e., hMAP)

Given new instance x, what is its most probable classification?


Bayes Optimal Classifier

Bayes optimal classification:

\begin{displaymath}\arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\vert h_{i}) P(h_{i}\vert D)\end{displaymath}


P(h1|D)=.4, P(-|h1)=0, P(+|h1)=1  
P(h2|D)=.3, P(-|h2)=1, P(+|h2)=0  
P(h3|D)=.3, P(-|h3)=1, P(+|h3)=0  

$\displaystyle \sum_{h_{i} \in H} P(+\vert h_{i}) P(h_{i}\vert D)$ = .4  
$\displaystyle \sum_{h_{i} \in H} P(-\vert h_{i}) P(h_{i}\vert D)$ = .6  

$\displaystyle \arg \max_{v_{j} \in V} \sum_{h_{i} \in H} P(v_{j}\vert h_{i}) P(h_{i}\vert D)$ = -  

Naive Bayes Classifier

Along with decision trees, neural networks, nearest nbr, one of the most practical learning methods.

When to use

Successful applications:

Naive Bayes Classifier

Assume target function $f: X \rightarrow V$, where each instance x described by attributes $\langle a_{1}, a_{2} \ldots a_{n} \rangle$.

Most probable value of f(x) is:

vMAP = $\displaystyle \mathop{\rm argmax}_{v_{j} \in V} P(v_{j} \vert a_{1}, a_{2} \ldots a_{n})$  
vMAP = $\displaystyle \mathop{\rm argmax}_{v_{j} \in V} \frac{P(a_{1}, a_{2} \ldots a_{n}\vert v_{j})
P(v_{j})}{P(a_{1}, a_{2} \ldots a_{n})}$  
  = $\displaystyle \mathop{\rm argmax}_{v_{j} \in V} P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) P(v_{j})$  

Naive Bayes assumption:

\begin{displaymath}P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) = \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}

which gives

\begin{displaymath}\mbox{\bf Naive Bayes classifier: } v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j})
\prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}

Naive Bayes Algorithm


For each target value vj
$\hat{P}(v_j) \leftarrow$ estimate P(vj)
For each attribute value ai of each attribute a
$\hat{P}(a_i\vert v_j) \leftarrow$ estimate P(ai|vj)


\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} \hat{P}(v_{j}) \prod_{a_i \in x} \hat{P}(a_{i} \vert v_{j}) \end{displaymath}

Naive Bayes: Example

Day Outlook Temperature Humidity Wind PlayTennis


Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Consider PlayTennis again, and new instance

\begin{displaymath}\langle Outlk=sun, Temp=cool, Humid=high, Wind=strong \rangle \end{displaymath}

Want to compute:

\begin{displaymath}v_{NB} = \mathop{\rm argmax}_{v_{j} \in V} P(v_{j}) \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}

\begin{displaymath}P(y)\ P(sun\vert y)\ P(cool\vert y)\ P(high\vert y)\ P(strong\vert y) = .005 \end{displaymath}

\begin{displaymath}P(n)\ P(sun\vert n)\ P(cool\vert n)\ P(high\vert n)\ P(strong\vert n) = .021 \end{displaymath}

\begin{displaymath}\rightarrow v_{NB} = n \end{displaymath}

Naive Bayes: Subtleties

Conditional independence assumption is often violated

\begin{displaymath}P(a_{1}, a_{2} \ldots a_{n}\vert v_{j}) = \prod_{i} P(a_{i} \vert v_{j}) \end{displaymath}

\begin{displaymath}\mathop{\rm argmax}_{v_{j} \in V} \hat{P}(v_{j}) \prod_{i} \h...
...rgmax}_{v_{j} \in V} P(v_{j}) P(a_{1} \ldots, a_n \vert v_{j}) \end{displaymath}

Naive Bayes: Subtleties

what if none of the training instances with target value vj have attribute value ai? Then

\begin{displaymath}\hat{P}(a_i\vert v_j) = 0 \mbox{, and...}\end{displaymath}

\begin{displaymath}\hat{P}(v_{j}) \prod_{i} \hat{P}(a_{i} \vert v_{j}) = 0 \end{displaymath}

Typical solution is Bayesian estimate for $\hat{P}(a_{i} \vert v_{j})$

\begin{displaymath}\hat{P}(a_{i} \vert v_{j}) \leftarrow\frac{n_{c} + mp}{n + m} \end{displaymath}


Learning to Classify Text


Naive Bayes is among most effective algorithms

What attributes shall we use to represent text documents??

Learning to Classify Text

Target concept $Interesting? : Document \rightarrow\{+,-\}$

Represent each document by vector of words
Learning: Use training examples to estimate
Naive Bayes conditional independence assumption

\begin{displaymath}P(doc\vert v_j) = \prod_{i=1}^{length(doc)} P(a_i=w_k \vert v_j) \end{displaymath}

where P(ai=wk| vj) is probability that word in position i is wk, given vj

one more assumption: $P(a_i=w_k\vert v_j) = P(a_m=w_k\vert v_j), \forall i,m$



1. collect all words and other tokens that occur in Examples

$Vocabulary \leftarrow$ all distinct words and other tokens in Examples

2. calculate the required P(vj) and P(wk|vj) probability terms

For each target value vj in V do
  • $docs_{j} \leftarrow$ subset of Examples for which the target value is vj

  • $P(v_{j}) \leftarrow\frac{\vert docs_{j}\vert}{\vert Examples\vert}$

  • $Text_{j} \leftarrow$ a single document created by concatenating all members of docsj

  • $n \leftarrow$ total number of words in Textj (counting duplicate words multiple times)

  • for each word wk in Vocabulary
    • $n_{k} \leftarrow$ number of times word wk occurs in Textj

    • $P(w_{k}\vert v_{j}) \leftarrow\frac{n_{k} + 1}{n + \vert Vocabulary\vert}$


Twenty NewsGroups

Given 1000 training documents from each group

Learn to classify new documents according to which newsgroup it came from
soc.religion.christian sci.crypt
talk.religion.misc sci.electronics

Naive Bayes: 89% classification accuracy

Article from

From: (John Doe)
Subject: Re: This year's biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT

I can only comment on the Kings, but the most 
obvious candidate for pleasant surprise is Alex
Zhitnik. He came highly touted as a defensive 
defenseman, but he's clearly much more than that. 
Great skater and hard shot (though wish he were 
more accurate). In fact, he pretty much allowed 
the Kings to trade away that huge defensive 
liability Paul Coffey. Kelly Hrudey is only the 
biggest disappointment if you thought he was any 
good to begin with. But, at best, he's only a 
mediocre goaltender. A better choice would be 
Tomas Sandstrom, though not through any fault of 
his own, but because some thugs in Toronto decided

Learning Curve for 20 Newsgroups


Accuracy vs. Training set size (1/3 withheld for test)

Bayesian Belief Networks

Interesting because:

Naive Bayes assumption of conditional independence too restrictive
But it's intractable without some such assumptions...
Bayesian Belief networks describe conditional independence among subsets of variables
allows combining prior knowledge about (in)dependencies among variables with observed training data

(also called Bayes Nets)

Conditional Independence

Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Ygiven the value of Z; that is, if

\begin{displaymath}(\forall x_i,y_j,z_k) \ P(X = x_i \vert Y = y_j, Z = z_k) = P(X = x_i \vert Z
= z_k) \end{displaymath}

more compactly, we write

P(X | Y,Z) = P(X | Z)

Example: Thunder is conditionally independent of Rain, given Lightning

P(Thunder | Rain, Lightning) = P(Thunder | Lightning)

Naive Bayes uses cond. indep. to justify

P(X,Y|Z) = P(X|Y,Z) P(Y|Z)  
  = P(X|Z) P(Y|Z)  

Bayesian Belief Network


Network represents a set of conditional independence assertions:

  • Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors.
  • Directed acyclic graph

Bayesian Belief Networks

A belief network represents the dependence between variables.

* nodes

* links

* conditional probability tables

In the Spotlight

Online Airline Pricing