Wednesday, April 13, 2016

NAÏVE BAYES


Now that we have seen linear regression, which is a discriminative model, let us see this model called Naïve Bayes to get a flavor of what a typical generative machine learning model looks like. So how is a generative model different from a discriminative one? To give a crude, sloppy description a generative model, while classifying, looks at the whole data model unlike the discriminative model which discriminates only on the basis of the decision boundary. The coolest thing that I find about a generative model is that it also allows us to generate ‘synthetic observations’.

So coming to the Naïve Bayes classifier, it is a Bayesian probabilistic classifier and classifies the given x to the most probable class, given observation.

$\hat{y}$=$\max_{y}$ P(y|x)

Where P(y|x)= $\frac{P(x|y)P(y)}{\sum_{y’}P(x|y’)P(y’)}$

·        P(x|y) is called the class model and tells us how probable is x happening assuming y to be the correct label.

·         P(y) is the prior and scales up or down P(y|x) according to the general probability of y happening.

·      The denominator does not have any role in classifying a single example, as it basically is a constant that normalizes the probabilities across the classes. It, however, becomes crucial if you wish to rank a set of x in their probabilities of belonging to a class.


INDEPENDENCE ASSUMPTION:

So far everything seems to be good. But notice that while calculating P(x|y) what we are really calculating is $P(x_{1},x{2},…x_{d}|y)$ where x consists of d features.
By chain rule this can be simplified as:

$P(x_{1},x_{2},…x_{d}|y) = \prod_{i=1}^{d}P(x_{i}|x_{1},….x_{i-1},y)$

Even for a modest size feature vector these calculations will explode with so any possible combinations. The way we get around this is we make a slightly bold assumption that the different features of x are conditionally independent given y. Therefore, our probability boils down to:

$P(x_{1},x_{2},…x_{d}|y) = \prod_{i=1}^{d}P(x_{i}|x_{1},….x_{i-1},y)$ ………..(chain rule)
                                        $\simeq$ $\prod_{i=1}^{d} P(x_{i}|y)$.........(independence assumption)

As hinted earlier, the conditional independence can prove to be a costly assumption. Because of the assumption, if the only thing that separates two classes is the way in which the attributes are correlated, then Naïve Bayes will prove to be helpless. Feature dependence is a very important instrument in teasing out information from data, especially in applications like spam detection.
However, in examples in which the features are intuitively independent, the conditional assumption can actually prove to be an asset. For example, in cases were values of one of the dimensions of an $x_{j}$ is not known, you don’t need to scrap out the whole x altogether. You can just ignore the missing dimension and go about training your model.

IMPLEMENTATION IN PYTHON:









No comments:

Post a Comment