Maximum entropy
From NLPWiki
Contents |
Preliminary definitions
Notation
-
matrix
-
row vector in matrix
- aij cell i,j in matrix
- K number of classes
- F number of features
-
the input vectors in the training data; each row is an input vector
-
the classes in the training data; yi is the class of input vector
- N the amount of training data, i.e. the number of rows in
and
-
weight matrix with K rows and F columns
-
feature function of an input vector that returns a matrix, one row per output class; the number of columns is F. On analogy with matrix notation, we use
to indicate a specific row in the resulting matrix, and φyf to reference a single cell in the resulting matrix. A block feature vector is one in which
, i.e. the feature function is independent of class. This is the most common scenario and can simplify implementation. In that case, the subscripts can be ignored and the output of
can treated as a row vector.
Log-linear models
The model we will be dealing with has the form of a log-linear model, i.e. if
(
is a column vector) then the likelihood function f is defined as:
In the context of maximum entropy and related models, we usually extract features from the original input vector. This can be represented as:
The maximum entropy model (parameter estimation)
The maximum entropy model is the log-linear model having parameters
such that the resulting distribution (edit: distribution over what?) has maximum entropy subject to the constraints represented by the training data
. It can be shown that the parameters
that maximize the log-likelihood of the data
constitute the maximum entropy model. Using standard calculus techniques to find the maximum, we need to find the gradient
and therefore the partial derivatives
.
Log-likelihood and conditional log-likelihood
The log-likelihood of a log-linear model is:
Since the term
is independent of the parameters
and the input vectors are fixed and known (i.e. observed), this term is a constant that can be ignored in the context of our maximization problem. The result is the conditional log-likelihood (so called because it consists of the product of the conditional likelihoods
, i.e. ignoring
):
Partial Derivatives
When taking the partial derivatives of the conditional log-likelihood, it will be beneficial to further split the conditional log-likelihood of the data as follows:
i.e. we split the left side of the equation into a sum over partitions of the training data where each partition consists of instances in the training data having the same class.
It is possible to distribute the partial derivative of both operands of the subtraction.
Numerator (Left Term)
Denominator (Right Term)
Bayesian parameter estimation
From a Bayesian perspective, we can easily put a prior over
. One common choice for a prior are independent Normals centered at 0 with variance σ2, i.e.
Typical values of σ2 are 0.5-1.0.
We are now interested in the posterior distribution
although this is no longer the maximum entropy solution (edit: double check).
Evaluating the posterior is intractable, although, in theory, MCMC approaches could be used to generate samples from the posterior. Instead, often times the MAP parameter
is sought. As before, this optimization problem requires computing the gradient of the log of the posterior. Since the posterior is proportional to the likelihood times the prior, all that changes in the math above is the addition of the log of the prior:
Other priors
Another prior that has been used is the Laplacian (see Goodman). The advantage of this prior is that it is spiked at zero so that most weights are driven to zero (a form of feature selection).
Relation to regularization
It can be shown that adding a Gaussian prior is the same as L2 normalization of the weights and adding a Laplacian prior is equivalent to L1 normalization of the weights in the MAP solution.
Acknowledgments
This page is based heavily on material originally provided by Dan Klein of Berkeley and further refinements by Eric Ringger and students. See below for more links from both sources.
Links
- Eric Ringger's maxent slides in the context of word sense disambiguation (see also Intro and Smoothing MaxEnt)
- Eric Ringger's maxent slides in the context of document classification
- Eric Ringger's slides on the relation of maxent to logistic regression
- Dan Klein's current slides (2PP) and previous slides (2PP)
- http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/tutorial.html
- http://maxent.sourceforge.net/about.html
- http://opennlp.sourceforge.net/
