Sunday, December 18, 2016

Traditional Machine learning, Deep learning and Transfer learning on CIFAR 10


In order to get an overview of various techniques of Image Classification, I trained models to perform image classification using different techniques on CIFAR 10 data set. This post shares the learnings from this experiment.

1)      FEATURE EXTRACTION + MACHINE LEARNING CLASSIFIER:

Before the rise of deep learning in the last 10 years, this was basically what people used to do to classify images: They first used some feature extractor to convert each H x W x C dimensional image into a modest size feature vector that would try to capture all the information of the image that we require to solve the classification problem. This feature vector is then fed to a machine learning classifier like, say SVM with a nonlinear kernel.
To use this method I used a concatenation of Histogram of Oriented Gradients (HOG) and Colour Histogram. Since HOG captures the texture information of an image and ignores colour, and Colour Histogram captures colour information and ignores texture, a concatenation of these two methods gave me a good enough feature vector to work with. On these feature vectors I tried running classifiers like Random Forests, Shallow ANNs, and SVM with linear and RBF kernels. Of these, SVM (with RBF kernels) gave me the best accuracy of a modest 58 percent on a 10 fold cross validation.

2)      A SHALLOW CNN TRAINED FROM SCRATCH:

The next thing I tried and was particularly excited about was training my own CNN model. Because of limited hardware resources, I could only train a minimally deep model of the architecture conv-relu-pool—conv-relu-pool---conv-relu-pool---FC—FC-softmax. I added dropouts in between so as to not over fit, final error function was of course cross entropy, since we have softmax at the end. After experimenting with different learning rates (advisable to decrease in steps of 10 as in 0.1, 0.01, 0.001, 0.0001) and optimizers (Stochastic Gradient Descent with and without momentum, Adam, AdaGrad, RMSprop, etc.) The best I could do was close to 70 percent validation and test accuracy. This is quite an increase in accuracy as compared to the earlier approach, but the computation was a lot more expensive than before. To get this in context, training with the first approach took less than half an hour, while the conv net took the whole night to train on my PC.
For implementing this I used Keras, which is a high level wrapper on TensorFlow or Theano and is quite useful for fast prototyping of generic deep learning model.

3)      TRANSFER LEARNING:

Transfer learning is basically piggybacking on previously trained models on different generic data sets and customizing them by finetuning for your own application. There are quite a number of very deep models trained on multiple GPUs whose weights are open source online like AlexNet, VGGnet, GoogLeNet, etc. These models were trained for the ImageNet challenge on a much larger data set to classify into a much larger number of classes. We use transfer learning to tweak them to perform well on CIFAR 10 data. Transfer learning basically comes in two flavours. The first option is that you use the trained model as a feature extractor of our image data and run a softmax regression or SVM on top of it. This gave an accuracy of 55 percent which is less than that what we got by using HOG and colour histogram as feature extractors. The other and more commonly used approach is to use the previously trained model as initialization for and fine tune layers of that network on our data. This gave me an accuracy of 80 percent, with the same computation time as that of the approach where I trained the CNN from scratch.




Saturday, October 29, 2016

Text Based Information Retrieval System

As a part of the information retrieval course that I am taking in my college this semester, I had the opportunity to study about the Vector Space Model (VSM) of information retrieval systems, among other things. As a part of an assignment and also to reinforce my own understanding I coded up my own search engine in python. The corpus I used was the standard Reuters corpus available in NLTK, consisting of about 10000 documents and 53000 unique tokens after normalization (see case foldingstemming, lemmatization, etc).

INDEXING:


After tokenization and normalization (done using the inbuilt functions of the NLTK library), the next step in the process of building an inverted index. I kept it very simple here, using a python dictionary (implemented by a hash table) with key as each term and corresponding value to be the list of doc ids that in which that particular term appears. This index handles normal queries effectively, but proves helpless in the case of prefix and suffix wild card queries. To take care of those I used an additional index built using 'Ternary Trie Data Structure'. For taking care of spelling mistakes, I used the Levenshtein Distance technique of dynamic programming. Unlike rest of the project which was done in python, the trie and spelling correction modules were developed in C++.




The Vector Space Model (VSM) :


What VSM essentially does is that it uses some method of feature extraction to represent each document as a vector. So when a query is given by the user, the query also, is converted into a vector in the same space. Cosine distances between the query and each document is computed and the top K most similar documents are retrieved.
The vector representation of documents is done by computing tf-idf scores of each term in the document. The document vector is a V dimensional vector (where V is the size of vocabulary of the corpus) and each term in the document weighted by the multiplication of its tf and idf scores. Here, tf is the term frequency, and it is a measure of how many times the word appears in that particular document. While idf is the inverse document frequency, and it is a measure of how rare the word is in the corpus. For example uncommon words like, say, ‘Pillsbury’ will have a high IDF while common stop words like ‘a’, ‘the’, etc. will have a small idf. So basically by multiplying the tf and idf wights, we get a measure of how many times a rare word appears in a document. And of course, each vector is length normalized at the end.
To store the idf and tf values I used a dictionary and a dictionary of tuples respectively.
Also note that most documents don’t have most words. So if at all you want to pre-compute the doc vectors, try using sparse matrix representation using scipy.

Context Based Query Expansion using word2vec model:


Even though the tf-idf scoring scheme gives results of modest relevance, it fails to take into account the context of the query. For example, if the query contains the term ‘good’ it will only search for that particular word and give zero weight to contextually similar words like ‘positive’, ‘better’, etc. For this I used the word2vec model to retrieve documents intuitively similar to the query.

Word2vec models come in two flavours- continuous bag of words model and skip gram model. These models deserve and will get a separate post of their own, but the high level idea is that they use neural networks to learn the vectors of words such that the vector captures the context of the word, where context of a word is defined by the a window of words surrounding that word. I used the gensim package which is a python library that uses deep learning to create word embeddings using the skip gram model and continuous bag of words model. After the model was trained on the corpus it could output top similar words for each term in the corpus. So, along with the query terms it also appends top two similar words for each query term. Since the reuters data set is relatively smaller than what is required to train a very good deep learning model, the top similar words are not always acurate. So even if it increases the recall, precision might get reduced for some queries. Hence, inorder to give more weight to the original terms in the query, the weights of the additional terms is halved.
Here are a few results after training on my corpus:


 Notice that these words are not exactly synonyms (which could have been much easier by cross referencing with an online dictionary), but words that appear in similar context, more precisely, words that have a higher probability of occurring in the same document. So appending these words to our query increases our chance of getting a higher recall and precision.
For example see the difference in top query result with (In [53]) and without (In [54]) the addition of word2vec feature for a test query 'bad crops'.






Sunday, August 28, 2016

A High Level Guide to Artificial Neural Networks



Today let’s try to have a high level view of Artificial Neural Networks, one of the most powerful machine learning methods that exist and rabbit hole that leads to the wonderland of deep learning, which is arguably the hottest area in computer science that exists today.

So let’s get straight to it. A typical architecture of an ANN looks like this:



As you can see, it consists of stacked layers of perceptrons each having a non-linear activation function. Each perceptron or a neuron is a computational unit that takes in input $x_{1}, x_{2}, x_{3} ..$ and 1 (the bias input) and outputs $f(W^{T}X)$ where $W$ is its corresponding weight vector and $f$ is the chosen activation function. 




This is a crude simulation of a biological neuron which receives inputs from various regions through its dendrites and fires when the activation value is greater than some threshold value. The inputs are weighted in the synaptic junction, and these weights which are dynamically changing, adjusting, are the learnable parameters of the neuron.

The first layer is an input layer and the rightmost layer is the output layer. The layers in between are called ‘hidden layers’ and their number is a hyper parameter that we choose beforehand depending upon the complexity of the model we wish to build. Neural networks can form very complex decision boundaries as they consist of several layers of non-linearity in the form of these hidden layers.

All the neurons in one layer are connected to the ones in the next with some weights that we wish to learn. The learning procedure starts with randomly initializing all the weights with small random numbers. The rest of the procedure consists of successive forward passes in which input layer takes in the training data and each neuron in the layer computes its respective value and feeds it to the hidden layers and the process is continued all the way up to the output layer. Once output layer has its prediction values, loss is computed choosing a suitable loss function and is backpropogated towards the input side. The process of backpropogation is basically application of chain rule across each neuron as gradients of objective function with respect to the weight vectors are updated by multiplying the local gradients with the incoming gradient. With the new updated gradients the weight vectors are also updated by gradient descent method that we discussed in the earlier posts. With the new weight vectors forward propagation happens again and the same procedure reiterates. This is a very high level and informal description of what goes on in the neural nets. For concrete mathematical formulation I recommend you to go through this link and also this.

Deep learning which has shown mindblowing results in the past few years is a field built upon ANNs. It basically deals with modelling neural networks in a way that can be scalable to large number of hidden layers. I will try to cover some of these cool models in my future posts. :-) 




Wednesday, August 17, 2016

Object Based Image Classification and Change Detection


In the course of my internship at Bhaskaracharya Institute of Space Application and Geoinformatics (BISAG) at Gandhinagar I had the chance to work on an interesting application of machine learning and image processing in remote sensing. The aim of our project was object based classification of multiband satellite images.
To understand what is what let me first give you a high level understanding of what object based image classification means exactly. The problem of land cover classification has been a very busy research area in GIS. Traditionally, it was tackled by an approach called ‘Pixel based Classification’. What this method does is, while training, it looks at individual pixels independently and trains and classifies them just on the basis of their spectral attributes. Though this method works modestly it is not as good for high resolution data because it completely ignores the spatial attributes. For example a blue colored pixel from a water body and a blue pixel from a car will be looked at in the same way in this approach. A newer approach over this is ‘Object based image classification’ which I will walk you through in this post.
The first and the key step after some pre-processing was image segmentation. What this step did was it grouped similar pixels into clumps of pixels called super pixels or segments in an unsupervised manner. We first tried K means but moved on to built in sophisticated segmentation algorithms in python's skimage library like quick shift segmentation. This is how an image looks like after segmentation:

With K-means:




With Felzenswalb’s and Quickshift:




These segments will be our objects and serve as a heuristic for similar pixels of a single object. The  labelled training data that we had was in form of something called shape files. Shape file is basically vector data like lines, polygons, etc. So our shape files marked certain areas in the image and labelled each pixel in that region. We mapped each labelled pixel to its corresponding segment and thus defined the training region in this image. Another challenge was data representation of segments because each segment consisted of variable number of pixels. How we got around this was using the scipy package to describe 6 statistics of each segment and use these six as attributes for the data representation of each segment. So once the training regions are isolated and we have a concise representation for each segment we can finally deploy our ML classifier to learn it. We used  SVM with linear and RBF kernels and random forests. Maximum accuracy of 97% was seen in random forest classifier.
The coolest thing about this project was that we built a machine learning model just out of a single image. I learnt a lot in this project and also got an opportunity to get a flavor of remote sensing area and exposure to GIS field which probably I would not have gotten anywhere else.

Sunday, June 12, 2016

SUPERVISED CLASSIFICATION I: LOGISTIC REGRESSION



Today let’s look at one of the most fundamental methods of classification in machine learning and statistics- Logistic regression.

To understand a general supervised classification problem let us take an example of spam classification. Our training data consists of a number of emails and since it is a supervised problem, we also know if they are spam(Y=1) or ham (not spam, Y=0). For feature vector X (n dimensional) , we might have each dimension corresponding to a word occurrence, being 0 or 1 depending on whether or not is present in that mail. So a spam may have 1s in the dimensions corresponding to words like ‘offer’, ‘free’, ‘Viagra’ etc. Our aim here will be training our weight vector $\theta$ and decide some threshold t such that:

hypothesis = $h_{\theta}(x)$ >  t          for all spam
                     $h_{\theta}(x)$<=  t        for all ham

Value of threshold is determined based on that particular application depending on the tradeoff we want to have between false positives and false negatives. For concreteness let’s assume the threshold to be 0.5. After training our weight vectors we move on to the test data and try to model $P(Y|X,h_{\theta})$ and classify our samples.


HYPOTHESIS FUNCTION:


Since the hypothesis models the probability of a particular X falling in a category Y given $\theta$, it makes sense that it is a value bounded between 0 and 1. Hypothesis that we used in linear regression was $\theta^{T}.X$. In order to keep it bounded, in logistic regression we use an activation function called sigmoid function which is defined as:

$sigmoid(z) = \frac{1}{1+e^{z}}$

So our hypothesis becomes:

$h_{\theta}(x)=sigmoid(\theta^{T}.X)= \frac{1}{1+e^{\theta^{T}X}}$

Graphically it looks something like this:





COST FUNCTION:

In linear regression we used average of squared error between hypothesis and labels as our cost function. But since our hypothesis here is non-linear, we don’t get a nice convex curve as our cost function here. To make it convex cost function $J (X| Y, h_{\theta}(X))$ is defined as follows in logistic regression:

When Y=1:

$J (X| Y, h_{\theta}(X))= -log(h_{\theta}(X))$ 


When Y=0:

$J (X| Y, h_{\theta}(X))= -log(1-(h_{\theta}(X))) $


This can be written in a more compact form like this:
$ J (X| Y, h_{\theta}(X))= -[Y.log(h_{\theta}(X))  +  (1-Y) .log(1-(h_{\theta}(X)))]$

Plot the cost function separately for Y=1 and Y=0 with the hypothesis function and try to convince yourself why it should make sense.

Once you have the cost function learning the weight vectors boils down to a pure optimization problem. To solve this you can either use gradient descent as discussed in the linear regression post or you can not get your hands dirty and piggyback one of the off-the-shelf optimization function solvers available.


REGULARIZATION:

There is a problem with what we have discussed thus far. If the number of features is large this can build a very complicated, convoluted model with very large weight values so as to perfectly fit the training data. This might sound good, but the problem is that our model will be very biased towards the traing data and will not generalize well for the test data which, as a matter of fact, is our main objective. To do away with this problem we use the method of regularization which puts a penalty on the magnitude of the weight vectors and does not let the model overfit.
This is done by modifying our cost function as follows:

$ J (X| Y, h_{\theta}(X))= -[Y.log(h_{\theta}(X))  +  (1-Y) .log(1-(h_{\theta}(X)))]$                                                               +$\frac{\lambda}{2m}\sum\theta^{2}$

The last part in the regularization term and $\lambda$ is called the regularization parameter.

The value of $\lambda$ should be selected wisely. If it is too high you penalize the weight vectors too much and the model does not fit even the training data properly (underfitting). On the flip side if you are being too cautious and select a very small value for lambda, regularization hardly does what it is supposed to and you have overfitting.  




Wednesday, May 25, 2016

Decision Trees


Decision trees, as the name suggests, have a tree hierarchical structure consisting two types of nodes: Internal decision nodes and prediction nodes (which basically are the leaf nodes of the tree). Graphically it looks something like this:







To get a flavor of how this works, assume that someone has built a tree for you. Now, how do you make a prediction? Your problem, like any generic machine learning problem, is to predict $y_{i}$ given $x_{i}$. So, to get your answer, what you do is drop each $x_{i}$ down the tree (path dictated by the decision nodes), until you finally hit a leaf/decision node, which is your $y_{i}$.

Looks pretty straight forward? It is. But the real problem lies not in making predictions from an already constructed tree, but in actually building or learning the tree. The basic idea behind construction of a decision tree is that you go on dividing your feature space into regions using the decision nodes till you are left with pure regions populated homogeneously with (almost) one type of labels. So building of a decision tree can be formulated into a recursive function in which we first check if our pre-formulated stopping criterion is met. If it is met, we make a prediction; if not, we use some method to split the node into two parts. Then we recurse on the left and right children of the node to do a similar analysis and eventually get our tree.

So the three things we need to know for building a decision tree are:

1) How to decide an stopping criterion:
This does not have as concrete an answer that you might be looking for. It might vary from problem to problem but there are two basic ideas that give us a heuristic:
·  When the leaf is pure, i.e. the variance of the target variable is lesser than some threshold $\epsilon$ value that you decide.
·  When the number of examples coming to that node becomes lower than some value.

2) How to find the best split:
To answer this question we tend to use a different strategy while dealing with categorical and continuous variables:

·  Classification (INFORMATION GAIN):
In case of classification problems we use the concept of Information Gain. To understand this concept we first need to understand ‘entropy’. 
Entropy of X is basically a measure of how distributed and noisy its distribution is. If X is uniform and boring we can say that it has a high entropy. While if the distribution of X is full of peaks and valleys it can be said to have a low entropy. Mathematically entropy is modeled as follows:

 Entropy  of X = H(X)= $-\sum_{j=1}^{m} p_{j}log  p_{j} $

Where $p_{j}$ is the probability of X taking the $j^{th}$ value, with j iterating over all samples of X.

So physically entropy of X is just the measure of how easy it is to predict the value of X.

Now that we have an idea of what entropy basically is let’s delve a little bit further into it and define the following:
§  Specific conditional entropy H(Y|X=v): Entropy of Y only among records of X have the value v.
§  Conditional entropy H(Y|X): Average specific conditional entropy of Y.
Now that we have all the tools let us finally get down to the business of information gain.

$Information  Gain =  IG(Y|X) = H(Y) - H(Y|X)$

So information gain basically tells us how many bits we would save while transmitting Y from one end to other if the other end had the information about X ahead of time.
So, while splitting, what we can do is compute the information gain for each feature $X_{i}$ ,ranking them in descending order and taking into consideration only those features who have higher values of information gain.

·   Regression:
In case of continuous variables we find what we call the ‘impurity’. A node D is split into subtrees $D_{L}$ and $D_{R}$ that maximize the difference in impurity given by:

$|D|.Var(D)-|D_{R}|.Var(D_{R}|-|D_{L}|.Var(D_{L})$

3) How to predict:

This one is more obvious that the rest. If you have categorical labels just predict the value majority of the examples have in that particular leaf node. While in case of continuous variables you may take the average value of the examples. Another way would be to build a small regression model to fit the examples at that node.

So using decision trees in quite a handy large scale machine learning algorithm , especially for multi-class classification. However, the downside to it is that it is more prone to fall for over fitting if the amount of training data is small. To mitigate this problem there is another method called 'Random Forests' which basically uses a number of random decision trees to reduce the variance.




Wednesday, April 13, 2016

NAÏVE BAYES


Now that we have seen linear regression, which is a discriminative model, let us see this model called Naïve Bayes to get a flavor of what a typical generative machine learning model looks like. So how is a generative model different from a discriminative one? To give a crude, sloppy description a generative model, while classifying, looks at the whole data model unlike the discriminative model which discriminates only on the basis of the decision boundary. The coolest thing that I find about a generative model is that it also allows us to generate ‘synthetic observations’.

So coming to the Naïve Bayes classifier, it is a Bayesian probabilistic classifier and classifies the given x to the most probable class, given observation.

$\hat{y}$=$\max_{y}$ P(y|x)

Where P(y|x)= $\frac{P(x|y)P(y)}{\sum_{y’}P(x|y’)P(y’)}$

·        P(x|y) is called the class model and tells us how probable is x happening assuming y to be the correct label.

·         P(y) is the prior and scales up or down P(y|x) according to the general probability of y happening.

·      The denominator does not have any role in classifying a single example, as it basically is a constant that normalizes the probabilities across the classes. It, however, becomes crucial if you wish to rank a set of x in their probabilities of belonging to a class.


INDEPENDENCE ASSUMPTION:

So far everything seems to be good. But notice that while calculating P(x|y) what we are really calculating is $P(x_{1},x{2},…x_{d}|y)$ where x consists of d features.
By chain rule this can be simplified as:

$P(x_{1},x_{2},…x_{d}|y) = \prod_{i=1}^{d}P(x_{i}|x_{1},….x_{i-1},y)$

Even for a modest size feature vector these calculations will explode with so any possible combinations. The way we get around this is we make a slightly bold assumption that the different features of x are conditionally independent given y. Therefore, our probability boils down to:

$P(x_{1},x_{2},…x_{d}|y) = \prod_{i=1}^{d}P(x_{i}|x_{1},….x_{i-1},y)$ ………..(chain rule)
                                        $\simeq$ $\prod_{i=1}^{d} P(x_{i}|y)$.........(independence assumption)

As hinted earlier, the conditional independence can prove to be a costly assumption. Because of the assumption, if the only thing that separates two classes is the way in which the attributes are correlated, then Naïve Bayes will prove to be helpless. Feature dependence is a very important instrument in teasing out information from data, especially in applications like spam detection.
However, in examples in which the features are intuitively independent, the conditional assumption can actually prove to be an asset. For example, in cases were values of one of the dimensions of an $x_{j}$ is not known, you don’t need to scrap out the whole x altogether. You can just ignore the missing dimension and go about training your model.

IMPLEMENTATION IN PYTHON:









Saturday, March 26, 2016

Finding Similar Articles From A Massive Dataset


One of the most typical problems in data mining is that of finding similar sets from a huge data set. This can be similar text articles, web pages, movie rating, or any some other data for that matter. For the sake of concreteness let us consider similarity between text articles for this post. The process of finding similar articles can be broadly broken down into three steps:

1)      Shingling

2)      Minhashing

3)      Locality sensitive hashing

STEP 1: BREAKING DOWN THE ARTICLE INTO K-SHINGLES

A k-shingle is basically a substring of the article of length k. For example the set of 3-shingles for the string “Thug life” will be {“Thu”, “g l”, “ife”}. The value of k should be large enough so that the probability of a given shingle appearing in an article is considerably small. Also while selecting a shingle one should take into account how typical the article itself is. For example if our data corpus is something like e-mails then we might take k=5 while if we are dealing with research papers we might find it safe to take k=9.
It is not very convenient to manipulate and store strings. So the set of shingles is generally converted into a sparse matrix called characteristic matrix in which columns correspond to the data sets we have and rows are all the possible elements which a document might contain.



There are a lot of other intricacies and tricks involved in selecting the shingles and then manipulating them so that they become easy to handle which I may touch upon in future posts, but the general motivation behind this concept is that “DOCUMENTS THAT ARE INTUITIVELY SIMILAR MAY HAVE SHINGLES IN COMMON”

STEP 2: MINHASHING

Before getting into minhashing, it is important to understand what the most convenient definition of similarity is while finding similar textual articles.

Jaccard similarity: It is defines similarity of two columns $C_{1}$ and $C_{2}$ as:
Similarity( $C_{1}, C_{2})= \frac{C_{1} \cap C_{2}}{C_{1}\cup C_{2}}$

The basic goal of minhashing is to compress the large characteristic matrix (obtained after shingling) to short signatures keeping the (Jaccard) similarity of columns intact.

The way this method works is as follows. Imagine the rows of the characteristic matrix to be permuted randomly. For illustration purpose imagine that the rows of the following matrix is permuted such that the corresponding serial number of each row is the number in purple adjacent to it.




The row of the signature matrix corresponding to this permutation is the one on the right. This is computed by going serially row by row and substituting the row number in place of the columns which have 1 until all the columns are filled. Choosing different hash functions to generate different random permutations more (for example 100) rows can be added to the signature matrix as follows:
The most interesting thing about this technique is that it can be proved that the Jaccard’s similarity of any two columns is same as the probability of getting the same set of numbers in the signature matrix. Thus expected similarity of two columns is equal to the Jaccard’s similarity of the corresponding ones in the input matrix. Thus, longer the signatures, smaller will be the expected error.



All this looks good in theory. However, actually permuting data sets of even modest sizes (say 1 billion rows) will take too much of time and space. So we try to be clever here and simulate permutations without actually permuting the rows. We pick different hash functions, one for minhash function and pretend that the position of row r in the permutation is h(r). We keep a slot M (i,c) for each column c and hash function $h_{i}(c)$.

Observe the following pseudo code and notice that the intent is that M (i,c) will become the smallest value of $h_{i}$(r) for which column c has 1 in row r ($h_{i}(r)$ gives the order of rows for 
$i^{th}$ permutation)

For each row r do begin
                For each hash function $h_{i}$ do
                                Compute $h_{i}$(r);
                For each column c
                                If c has 1 in row r
                                                For each hash function $h_{i}$ do
                                                                If $h_{i}$(r) < M(I,c):
                                                                                M(I,c):=$h_{i}(r)$
end

STEP 3: LOCALITTY SENSITIVE HASHING (LSH)

Even after using minhashing to compute signatures, comparing all possible columns can be expensive. This is where LSH comes into picture. What LSH basically does is that from all possible pairs of signatures it generates a small list of candidate pairs who have a higher chance of being similar.
This is done by dividing the matrix M into b bands of r rows each as shown below:



For each hash function hash its portion of each column to a hash table with k buckets. Candidate column pairs are those that hash into the same bucket for more than one band. Of course there are bound to be few false negatives. But we can always tune b and r to catch most similar pairs and few non similar pairs.


Friday, March 11, 2016

Linear regression and implementation in python

The problem statement of any machine learning application can be boiled down to this: we are given some data sets with or without labels attached to them; and we have to formulate a hypothesis function that is closest to the labels of any other samples of data sets that might be given to us at after training our algorithm. 
  • Supervised learning: The training data has labels attached to it and can be represented as a tuple $(x_{i},y_{i})$ where i can go from 0 to the number of examples provided (say m) and each $x_{i}$ being an n dimensional vector with each dimension corresponding to features that influence the label y. 
  • Unsupervised learning: The training data is unlabeled and represented as ($x_{i}$).
The way we go about building the model in linear regression and most supervised learning algorithms is that some weight $(t_{j})$ is assigned to each dimension of x and a hypothesis function 
$h_{t}(x_{i})$ is calculated. The aim is to solve our machine learning problem like an optimization problem where we evaluate the cost function ( J(t) ) for different values of t and find that particular t value that minimizes the cost function.

$J(t)=\frac{\sum_{m=0}^{m}[h_{t}(x_{i})-y_{i}]^{2}}{2m}$



The hypothesis function (which is fixed for a particular model) for linear regression is given by:


 $h_{t}(x)$=$t_{0}$ + $t_{1}x_{1}$ + $t_{2}x_{2}$ ....+ $t_{n}x_{n}$

It is easier to deal with their vectorized form which reduces the hypothesis function to:
$h_{t}(x)=t'*x$
where $h_{t}$, t and x are n+1 dimensional column vectors and t' is t transpose.

Gradient descent:

One of the optimization techniques that we might use for finding the optimum weight vector is gradient descent. The pseudo code for gradient descent for multivariate linear regression can be written as:








Here $\theta_{j}$ is nothing but the jth weight vector $(t_{j})$ and α is called the learning rate of the algorithm.

Intuitively, after each iteration the algorithm is traversing in the n dimensional space of the cost function in the direction of descent till it reaches the minima. The magnitude of each step is proportional to the learning rate α and the derivative of the cost function with respect to t. Clearly, gradient descent will converge at the minima when the derivative of the cost function becomes zero.



For effectively applying gradient descent two things should be kept in mind:
  • Feature scaling and mean normalization: By feature scaling we mean that all the features should take a similar range of values. Mean normalization means that the mean of all the features should have approximately zero mean. In practice this is done by replacing all xi by (xi-µ)/σ , where µ is the mean and σ is the standard deviation of all the features.
  • Choosing the learning rate parameter: If the learning rate parameter is too large the gradient descent might not settle at the global minimum while if it is too small the algorithm will take a lot of time to converge.

Implementation in python using sklearn library:





Here ‘TV’ and ‘Radio’ are two features and Sales is what we are predicting in our data set 'data'. The weight vectors t1, t2 are stored by default in the variable coef_ and t0 is stored in the variable intercept_.

Now that we have trained the weight vectors lets go on to make predictions on the test data set.





R2 score is a statistical parameter that tells us how well a line is fit in our regression model. R2 score of 1 indicates a perfect fit. We are getting an R2 score of 0.897 which is pretty decent. Here we have taken the training set itself as the test set and are finding the rmse between the predicted and the actual value. This does not give us a very accurate idea of our model as it might be over-fitting and not be performing so well on the new test sample that might be given to us.

A better way to build the model is to split training data and reserve some portion of it as test set.



        
Notice that the R2 score has decreased a bit. But a score of 0.86 is still pretty good !