Now that we have seen linear
regression, which is a discriminative model, let us see this model called Naïve
Bayes to get a flavor of what a typical generative machine learning model looks
like. So how is a generative model different from a discriminative one? To give
a crude, sloppy description a generative model, while classifying, looks at the
whole data model unlike the discriminative model which discriminates only on
the basis of the decision boundary. The coolest thing that I find about a
generative model is that it also allows us to generate ‘synthetic
observations’.
So coming to the Naïve Bayes
classifier, it is a Bayesian probabilistic classifier and classifies the given
x to the most probable class, given observation.
$\hat{y}$=$\max_{y}$ P(y|x)
Where P(y|x)=
$\frac{P(x|y)P(y)}{\sum_{y’}P(x|y’)P(y’)}$
· P(x|y)
is called the class model and tells us how probable is x happening assuming y
to be the correct label.
· P(y)
is the prior and scales up or down P(y|x) according to the general probability
of y happening.
· The
denominator does not have any role in classifying a single example, as it
basically is a constant that normalizes the probabilities across the classes.
It, however, becomes crucial if you wish to rank a set of x in their
probabilities of belonging to a class.
INDEPENDENCE ASSUMPTION:
So far everything seems to be good.
But notice that while calculating P(x|y)
what we are really calculating is $P(x_{1},x{2},…x_{d}|y)$ where x consists of
d features.
By chain rule this can be
simplified as:
$P(x_{1},x_{2},…x_{d}|y) =
\prod_{i=1}^{d}P(x_{i}|x_{1},….x_{i-1},y)$
Even for a modest size feature
vector these calculations will explode with so any possible combinations. The way
we get around this is we make a slightly bold assumption that the different
features of x are conditionally
independent given y. Therefore, our probability boils down to:
$P(x_{1},x_{2},…x_{d}|y) =
\prod_{i=1}^{d}P(x_{i}|x_{1},….x_{i-1},y)$ ………..(chain rule)
$\simeq$ $\prod_{i=1}^{d} P(x_{i}|y)$.........(independence assumption)
As hinted earlier, the conditional
independence can prove to be a costly assumption. Because of the assumption, if
the only thing that separates two classes is the way in which the attributes
are correlated, then Naïve Bayes will prove to be helpless. Feature dependence
is a very important instrument in teasing out information from data, especially
in applications like spam detection.
However, in examples in which the
features are intuitively independent, the conditional assumption can actually
prove to be an asset. For example, in cases were values of one of the
dimensions of an $x_{j}$ is not known, you don’t need to scrap out the whole x
altogether. You can just ignore the missing dimension and go about training
your model.
IMPLEMENTATION IN PYTHON: