Naive Bayes

From NLPWiki

Jump to: navigation, search

Contents

Links

Before we dive into this topic, here are some usefule links.

Notation

This section outlines the notation used throughout this document.

  • Failed to parse (Missing texvc executable; please see math/README to configure.): D_i
- Document Failed to parse (Missing texvc executable; please see math/README to configure.): i

.

  • Failed to parse (Missing texvc executable; please see math/README to configure.): F_j
- Feature Failed to parse (Missing texvc executable; please see math/README to configure.): j
(word in a text document).
  • Failed to parse (Missing texvc executable; please see math/README to configure.): F_{i,j}
- Feature Failed to parse (Missing texvc executable; please see math/README to configure.): j
in Document Failed to parse (Missing texvc executable; please see math/README to configure.): i

.

  • Failed to parse (Missing texvc executable; please see math/README to configure.): C
- Class label.
  • Failed to parse (Missing texvc executable; please see math/README to configure.): C_i
- Class label for Document Failed to parse (Missing texvc executable; please see math/README to configure.): i

.

  • Failed to parse (Missing texvc executable; please see math/README to configure.): \{c_1, \ldots , c_n\}
- Values representing specific labels.
  • Failed to parse (Missing texvc executable; please see math/README to configure.): P(C=c_i| \ldots )
- This is a notation for a conditional probability.
  • Failed to parse (Missing texvc executable; please see math/README to configure.): P(c_i| \ldots )
- This is also a notation for a conditional probability.
  • Failed to parse (Missing texvc executable; please see math/README to configure.): P(C| \ldots )
- This is notation for a probability distribution.
  • Failed to parse (Missing texvc executable; please see math/README to configure.): P(C_i| \ldots )
- This is also notation for a probability distribution.

Derivation of Naive Bayes for Classification

First, what we're looking for is Failed to parse (Missing texvc executable; please see math/README to configure.): \hat c = argmax_c P(C_i = c| \underline{D_i}) , where Failed to parse (Missing texvc executable; please see math/README to configure.): \underline{D_i}

is the feature vector for document Failed to parse (Missing texvc executable; please see math/README to configure.): i
which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable Failed to parse (Missing texvc executable; please see math/README to configure.): C
takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class Failed to parse (Missing texvc executable; please see math/README to configure.): C_i

?

Now that we know what we want, here is the derivation:

using Bayes' Theorem:

Failed to parse (Missing texvc executable; please see math/README to configure.): \hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}


Note that with Failed to parse (Missing texvc executable; please see math/README to configure.): a

a constant:

Failed to parse (Missing texvc executable; please see math/README to configure.): argmax_x \frac{f(x)}{a} = argmax_x f(x)


Therefore:

Failed to parse (Missing texvc executable; please see math/README to configure.): argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)


(Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point).

By the multiplication rule

Failed to parse (Missing texvc executable; please see math/README to configure.): P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)


Because of the naive bayes assumption,

Failed to parse (Missing texvc executable; please see math/README to configure.): P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)


Now, the second part of the right-hand-side of this last equation can be written in short-hand as

Failed to parse (Missing texvc executable; please see math/README to configure.): \prod_{j=1}^n P(F_{i,j}|C_i = c)


so we now have

Failed to parse (Missing texvc executable; please see math/README to configure.): P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)


leading to

Failed to parse (Missing texvc executable; please see math/README to configure.): \hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)


For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow.

Failed to parse (Missing texvc executable; please see math/README to configure.): \hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))


The MLE (maximum likelihood estimator) for the first term, Failed to parse (Missing texvc executable; please see math/README to configure.): P(C_i = c)

is computed by taking the number of documents with label Failed to parse (Missing texvc executable; please see math/README to configure.): c
divided by the total number of documents in the training set.

The MLE for the term in the sumation ...

Probability Theory

Bayes' Theorem relies on some probability theory, which is covered here.

Sample Spaces and Events

An experiment is any action or process with uncertain outcomes. The possible outcomes of an experiment are represented by a sample space. Each of the possible outcomes is called an event. Events form a set, making up the sample space for a particular experiment.

Sample Space

Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space Failed to parse (Missing texvc executable; please see math/README to configure.): S

for this experiment is Failed to parse (Missing texvc executable; please see math/README to configure.): S = \{H,T\}
the only possible outcomes are a head or tail facing upward.
Events

An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, Failed to parse (Missing texvc executable; please see math/README to configure.): S = \{H,T\} , contains two events, Failed to parse (Missing texvc executable; please see math/README to configure.): H

and Failed to parse (Missing texvc executable; please see math/README to configure.): T

, heads and tails respectively.

Some Set Theory

Conditional Probability

Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let Failed to parse (Missing texvc executable; please see math/README to configure.): A

be the event that Stan is late, and Failed to parse (Missing texvc executable; please see math/README to configure.): B
be the event of inclement weather. Then, Failed to parse (Missing texvc executable; please see math/README to configure.): P(A|B)
is read, the probability that Stan will be late given that the weather is bad. Or, the probability of Failed to parse (Missing texvc executable; please see math/README to configure.): A
given Failed to parse (Missing texvc executable; please see math/README to configure.): B

.

In this case, Failed to parse (Missing texvc executable; please see math/README to configure.): P(A|B) = .25 , and Failed to parse (Missing texvc executable; please see math/README to configure.): P(A|B^{'}) = 0.5

where Failed to parse (Missing texvc executable; please see math/README to configure.): B^{'}
is the compliment of Failed to parse (Missing texvc executable; please see math/README to configure.): B

, that is, the weather is not bad.

Formally, conditional probability is defined as follows:

Failed to parse (Missing texvc executable; please see math/README to configure.): P(A|B) = \frac{P(A \cap B)}{P(B)} \!


This formula is interpreted as follows: The Probability of class Failed to parse (Missing texvc executable; please see math/README to configure.): A_i

given feature Failed to parse (Missing texvc executable; please see math/README to configure.): B
is equal to the probability of Failed to parse (Missing texvc executable; please see math/README to configure.): B
given class Failed to parse (Missing texvc executable; please see math/README to configure.): A_i


Let Failed to parse (Missing texvc executable; please see math/README to configure.): A_i

be the Failed to parse (Missing texvc executable; please see math/README to configure.): i^{th}
class that we are interested in, and Failed to parse (Missing texvc executable; please see math/README to configure.): B
be a feature of a document.

The Law of Total Probability

Let Failed to parse (Missing texvc executable; please see math/README to configure.): A_1, \ldots , A_k

be mutually exclusive and exhaustive events. Then for any other event Failed to parse (Missing texvc executable; please see math/README to configure.): B

, Failed to parse (Missing texvc executable; please see math/README to configure.): P(B) = \sum_{i=1}^k P(B|A_i)P(A_i) .

Bayes' Theorem

Failed to parse (Missing texvc executable; please see math/README to configure.): P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!


Failed to parse (Missing texvc executable; please see math/README to configure.): P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!


.

Personal tools