Wednesday, May 25, 2016

Decision Trees


Decision trees, as the name suggests, have a tree hierarchical structure consisting two types of nodes: Internal decision nodes and prediction nodes (which basically are the leaf nodes of the tree). Graphically it looks something like this:







To get a flavor of how this works, assume that someone has built a tree for you. Now, how do you make a prediction? Your problem, like any generic machine learning problem, is to predict $y_{i}$ given $x_{i}$. So, to get your answer, what you do is drop each $x_{i}$ down the tree (path dictated by the decision nodes), until you finally hit a leaf/decision node, which is your $y_{i}$.

Looks pretty straight forward? It is. But the real problem lies not in making predictions from an already constructed tree, but in actually building or learning the tree. The basic idea behind construction of a decision tree is that you go on dividing your feature space into regions using the decision nodes till you are left with pure regions populated homogeneously with (almost) one type of labels. So building of a decision tree can be formulated into a recursive function in which we first check if our pre-formulated stopping criterion is met. If it is met, we make a prediction; if not, we use some method to split the node into two parts. Then we recurse on the left and right children of the node to do a similar analysis and eventually get our tree.

So the three things we need to know for building a decision tree are:

1) How to decide an stopping criterion:
This does not have as concrete an answer that you might be looking for. It might vary from problem to problem but there are two basic ideas that give us a heuristic:
·  When the leaf is pure, i.e. the variance of the target variable is lesser than some threshold $\epsilon$ value that you decide.
·  When the number of examples coming to that node becomes lower than some value.

2) How to find the best split:
To answer this question we tend to use a different strategy while dealing with categorical and continuous variables:

·  Classification (INFORMATION GAIN):
In case of classification problems we use the concept of Information Gain. To understand this concept we first need to understand ‘entropy’. 
Entropy of X is basically a measure of how distributed and noisy its distribution is. If X is uniform and boring we can say that it has a high entropy. While if the distribution of X is full of peaks and valleys it can be said to have a low entropy. Mathematically entropy is modeled as follows:

 Entropy  of X = H(X)= $-\sum_{j=1}^{m} p_{j}log  p_{j} $

Where $p_{j}$ is the probability of X taking the $j^{th}$ value, with j iterating over all samples of X.

So physically entropy of X is just the measure of how easy it is to predict the value of X.

Now that we have an idea of what entropy basically is let’s delve a little bit further into it and define the following:
§  Specific conditional entropy H(Y|X=v): Entropy of Y only among records of X have the value v.
§  Conditional entropy H(Y|X): Average specific conditional entropy of Y.
Now that we have all the tools let us finally get down to the business of information gain.

$Information  Gain =  IG(Y|X) = H(Y) - H(Y|X)$

So information gain basically tells us how many bits we would save while transmitting Y from one end to other if the other end had the information about X ahead of time.
So, while splitting, what we can do is compute the information gain for each feature $X_{i}$ ,ranking them in descending order and taking into consideration only those features who have higher values of information gain.

·   Regression:
In case of continuous variables we find what we call the ‘impurity’. A node D is split into subtrees $D_{L}$ and $D_{R}$ that maximize the difference in impurity given by:

$|D|.Var(D)-|D_{R}|.Var(D_{R}|-|D_{L}|.Var(D_{L})$

3) How to predict:

This one is more obvious that the rest. If you have categorical labels just predict the value majority of the examples have in that particular leaf node. While in case of continuous variables you may take the average value of the examples. Another way would be to build a small regression model to fit the examples at that node.

So using decision trees in quite a handy large scale machine learning algorithm , especially for multi-class classification. However, the downside to it is that it is more prone to fall for over fitting if the amount of training data is small. To mitigate this problem there is another method called 'Random Forests' which basically uses a number of random decision trees to reduce the variance.




No comments:

Post a Comment