Saturday, March 26, 2016

Finding Similar Articles From A Massive Dataset


One of the most typical problems in data mining is that of finding similar sets from a huge data set. This can be similar text articles, web pages, movie rating, or any some other data for that matter. For the sake of concreteness let us consider similarity between text articles for this post. The process of finding similar articles can be broadly broken down into three steps:

1)      Shingling

2)      Minhashing

3)      Locality sensitive hashing

STEP 1: BREAKING DOWN THE ARTICLE INTO K-SHINGLES

A k-shingle is basically a substring of the article of length k. For example the set of 3-shingles for the string “Thug life” will be {“Thu”, “g l”, “ife”}. The value of k should be large enough so that the probability of a given shingle appearing in an article is considerably small. Also while selecting a shingle one should take into account how typical the article itself is. For example if our data corpus is something like e-mails then we might take k=5 while if we are dealing with research papers we might find it safe to take k=9.
It is not very convenient to manipulate and store strings. So the set of shingles is generally converted into a sparse matrix called characteristic matrix in which columns correspond to the data sets we have and rows are all the possible elements which a document might contain.



There are a lot of other intricacies and tricks involved in selecting the shingles and then manipulating them so that they become easy to handle which I may touch upon in future posts, but the general motivation behind this concept is that “DOCUMENTS THAT ARE INTUITIVELY SIMILAR MAY HAVE SHINGLES IN COMMON”

STEP 2: MINHASHING

Before getting into minhashing, it is important to understand what the most convenient definition of similarity is while finding similar textual articles.

Jaccard similarity: It is defines similarity of two columns $C_{1}$ and $C_{2}$ as:
Similarity( $C_{1}, C_{2})= \frac{C_{1} \cap C_{2}}{C_{1}\cup C_{2}}$

The basic goal of minhashing is to compress the large characteristic matrix (obtained after shingling) to short signatures keeping the (Jaccard) similarity of columns intact.

The way this method works is as follows. Imagine the rows of the characteristic matrix to be permuted randomly. For illustration purpose imagine that the rows of the following matrix is permuted such that the corresponding serial number of each row is the number in purple adjacent to it.




The row of the signature matrix corresponding to this permutation is the one on the right. This is computed by going serially row by row and substituting the row number in place of the columns which have 1 until all the columns are filled. Choosing different hash functions to generate different random permutations more (for example 100) rows can be added to the signature matrix as follows:
The most interesting thing about this technique is that it can be proved that the Jaccard’s similarity of any two columns is same as the probability of getting the same set of numbers in the signature matrix. Thus expected similarity of two columns is equal to the Jaccard’s similarity of the corresponding ones in the input matrix. Thus, longer the signatures, smaller will be the expected error.



All this looks good in theory. However, actually permuting data sets of even modest sizes (say 1 billion rows) will take too much of time and space. So we try to be clever here and simulate permutations without actually permuting the rows. We pick different hash functions, one for minhash function and pretend that the position of row r in the permutation is h(r). We keep a slot M (i,c) for each column c and hash function $h_{i}(c)$.

Observe the following pseudo code and notice that the intent is that M (i,c) will become the smallest value of $h_{i}$(r) for which column c has 1 in row r ($h_{i}(r)$ gives the order of rows for 
$i^{th}$ permutation)

For each row r do begin
                For each hash function $h_{i}$ do
                                Compute $h_{i}$(r);
                For each column c
                                If c has 1 in row r
                                                For each hash function $h_{i}$ do
                                                                If $h_{i}$(r) < M(I,c):
                                                                                M(I,c):=$h_{i}(r)$
end

STEP 3: LOCALITTY SENSITIVE HASHING (LSH)

Even after using minhashing to compute signatures, comparing all possible columns can be expensive. This is where LSH comes into picture. What LSH basically does is that from all possible pairs of signatures it generates a small list of candidate pairs who have a higher chance of being similar.
This is done by dividing the matrix M into b bands of r rows each as shown below:



For each hash function hash its portion of each column to a hash table with k buckets. Candidate column pairs are those that hash into the same bucket for more than one band. Of course there are bound to be few false negatives. But we can always tune b and r to catch most similar pairs and few non similar pairs.


Friday, March 11, 2016

Linear regression and implementation in python

The problem statement of any machine learning application can be boiled down to this: we are given some data sets with or without labels attached to them; and we have to formulate a hypothesis function that is closest to the labels of any other samples of data sets that might be given to us at after training our algorithm. 
  • Supervised learning: The training data has labels attached to it and can be represented as a tuple $(x_{i},y_{i})$ where i can go from 0 to the number of examples provided (say m) and each $x_{i}$ being an n dimensional vector with each dimension corresponding to features that influence the label y. 
  • Unsupervised learning: The training data is unlabeled and represented as ($x_{i}$).
The way we go about building the model in linear regression and most supervised learning algorithms is that some weight $(t_{j})$ is assigned to each dimension of x and a hypothesis function 
$h_{t}(x_{i})$ is calculated. The aim is to solve our machine learning problem like an optimization problem where we evaluate the cost function ( J(t) ) for different values of t and find that particular t value that minimizes the cost function.

$J(t)=\frac{\sum_{m=0}^{m}[h_{t}(x_{i})-y_{i}]^{2}}{2m}$



The hypothesis function (which is fixed for a particular model) for linear regression is given by:


 $h_{t}(x)$=$t_{0}$ + $t_{1}x_{1}$ + $t_{2}x_{2}$ ....+ $t_{n}x_{n}$

It is easier to deal with their vectorized form which reduces the hypothesis function to:
$h_{t}(x)=t'*x$
where $h_{t}$, t and x are n+1 dimensional column vectors and t' is t transpose.

Gradient descent:

One of the optimization techniques that we might use for finding the optimum weight vector is gradient descent. The pseudo code for gradient descent for multivariate linear regression can be written as:








Here $\theta_{j}$ is nothing but the jth weight vector $(t_{j})$ and α is called the learning rate of the algorithm.

Intuitively, after each iteration the algorithm is traversing in the n dimensional space of the cost function in the direction of descent till it reaches the minima. The magnitude of each step is proportional to the learning rate α and the derivative of the cost function with respect to t. Clearly, gradient descent will converge at the minima when the derivative of the cost function becomes zero.



For effectively applying gradient descent two things should be kept in mind:
  • Feature scaling and mean normalization: By feature scaling we mean that all the features should take a similar range of values. Mean normalization means that the mean of all the features should have approximately zero mean. In practice this is done by replacing all xi by (xi-µ)/σ , where µ is the mean and σ is the standard deviation of all the features.
  • Choosing the learning rate parameter: If the learning rate parameter is too large the gradient descent might not settle at the global minimum while if it is too small the algorithm will take a lot of time to converge.

Implementation in python using sklearn library:





Here ‘TV’ and ‘Radio’ are two features and Sales is what we are predicting in our data set 'data'. The weight vectors t1, t2 are stored by default in the variable coef_ and t0 is stored in the variable intercept_.

Now that we have trained the weight vectors lets go on to make predictions on the test data set.





R2 score is a statistical parameter that tells us how well a line is fit in our regression model. R2 score of 1 indicates a perfect fit. We are getting an R2 score of 0.897 which is pretty decent. Here we have taken the training set itself as the test set and are finding the rmse between the predicted and the actual value. This does not give us a very accurate idea of our model as it might be over-fitting and not be performing so well on the new test sample that might be given to us.

A better way to build the model is to split training data and reserve some portion of it as test set.



        
Notice that the R2 score has decreased a bit. But a score of 0.86 is still pretty good !