Naive Bayes
From NLPWiki
Contents |
Links
Before we dive into this topic, here are some usefule links.
- http://en.wikipedia.org/wiki/Naive_Bayes
- http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf
- http://citeseer.ist.psu.edu/30545.html
- http://jbnc.sourceforge.net/
- http://nltk.sourceforge.net/tutorial/classifying/index.html
- http://www.geocities.com/ResearchTriangle/Forum/1203/NaiveBayes.html
- http://www.resample.com/xlminer/help/NaiveBC/classiNB_intro.htm
Notation
This section outlines the notation used throughout this document.
- Di - Document i.
- Fj - Feature j (word in a text document).
- Fi,j - Feature j in Document i.
- C - Class label.
- Ci - Class label for Document i.
-
- Values representing specific labels.
-
- This is a notation for a conditional probability.
-
- This is also a notation for a conditional probability.
-
- This is notation for a probability distribution.
-
- This is also notation for a probability distribution.
Derivation of Naive Bayes for Classification
First, what we're looking for is
, where
is the feature vector for document i which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable C takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class Ci?
Now that we know what we want, here is the derivation:
using Bayes' Theorem:
Note that with a a constant:
Therefore:
(Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point).
By the multiplication rule
Because of the naive bayes assumption,
Now, the second part of the right-hand-side of this last equation can be written in short-hand as
so we now have
leading to
For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow.
The MLE (maximum likelihood estimator) for the first term, P(Ci = c) is computed by taking the number of documents with label c divided by the total number of documents in the training set.
The MLE for the term in the sumation ...
Probability Theory
Bayes' Theorem relies on some probability theory, which is covered here.
Sample Spaces and Events
An experiment is any action or process with uncertain outcomes. The possible outcomes of an experiment are represented by a sample space. Each of the possible outcomes is called an event. Events form a set, making up the sample space for a particular experiment.
Sample Space
Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space S for this experiment is S = {H,T}; the only possible outcomes are a head or tail facing upward.
Events
An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, S = {H,T}, contains two events, H and T, heads and tails respectively.
Some Set Theory
Conditional Probability
Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let A be the event that Stan is late, and B be the event of inclement weather. Then, P(A | B) is read, the probability that Stan will be late given that the weather is bad. Or, the probability of A given B.
In this case, P(A | B) = .25, and P(A | B') = 0.5 where B' is the compliment of B, that is, the weather is not bad.
Formally, conditional probability is defined as follows:
This formula is interpreted as follows: The Probability of class Ai given feature B is equal to the probability of B given class Ai
Let Ai be the ith class that we are interested in, and B be a feature of a document.
The Law of Total Probability
Let
be mutually exclusive and exhaustive events. Then for any other event B,
.
Bayes' Theorem
.
