Data Science: Finding Similar Articles From A Massive Dataset

One of the most typical problems in data mining is that of finding similar sets from a huge data set. This can be similar text articles, web pages, movie rating, or any some other data for that matter. For the sake of concreteness let us consider similarity between text articles for this post. The process of finding similar articles can be broadly broken down into three steps:

1) Shingling

2) Minhashing

3) Locality sensitive hashing

STEP 1: BREAKING DOWN THE ARTICLE INTO K-SHINGLES

A k-shingle is basically a substring of the article of length k. For example the set of 3-shingles for the string “Thug life” will be {“Thu”, “g l”, “ife”}. The value of k should be large enough so that the probability of a given shingle appearing in an article is considerably small. Also while selecting a shingle one should take into account how typical the article itself is. For example if our data corpus is something like e-mails then we might take k=5 while if we are dealing with research papers we might find it safe to take k=9.

It is not very convenient to manipulate and store strings. So the set of shingles is generally converted into a sparse matrix called characteristic matrix in which columns correspond to the data sets we have and rows are all the possible elements which a document might contain.

There are a lot of other intricacies and tricks involved in selecting the shingles and then manipulating them so that they become easy to handle which I may touch upon in future posts, but the general motivation behind this concept is that “DOCUMENTS THAT ARE INTUITIVELY SIMILAR MAY HAVE SHINGLES IN COMMON”

STEP 2: MINHASHING

Before getting into minhashing, it is important to understand what the most convenient definition of similarity is while finding similar textual articles.

Jaccard similarity: It is defines similarity of two columns $C_{1}$ and $C_{2}$ as:

Similarity( $C_{1}, C_{2})= \frac{C_{1} \cap C_{2}}{C_{1}\cup C_{2}}$

The basic goal of minhashing is to compress the large characteristic matrix (obtained after shingling) to short signatures keeping the (Jaccard) similarity of columns intact.

The way this method works is as follows. Imagine the rows of the characteristic matrix to be permuted randomly. For illustration purpose imagine that the rows of the following matrix is permuted such that the corresponding serial number of each row is the number in purple adjacent to it.

The row of the signature matrix corresponding to this permutation is the one on the right. This is computed by going serially row by row and substituting the row number in place of the columns which have 1 until all the columns are filled. Choosing different hash functions to generate different random permutations more (for example 100) rows can be added to the signature matrix as follows:

The most interesting thing about this technique is that it can be proved that the Jaccard’s similarity of any two columns is same as the probability of getting the same set of numbers in the signature matrix. Thus expected similarity of two columns is equal to the Jaccard’s similarity of the corresponding ones in the input matrix. Thus, longer the signatures, smaller will be the expected error.

All this looks good in theory. However, actually permuting data sets of even modest sizes (say 1 billion rows) will take too much of time and space. So we try to be clever here and simulate permutations without actually permuting the rows. We pick different hash functions, one for minhash function and pretend that the position of row r in the permutation is h(r). We keep a slot M (i,c) for each column c and hash function $h_{i}(c)$.

Observe the following pseudo code and notice that the intent is that M (i,c) will become the smallest value of $h_{i}$(r) for which column c has 1 in row r ($h_{i}(r)$ gives the order of rows for

$i^{th}$ permutation)

For each row r do begin

For each hash function $h_{i}$ do

Compute $h_{i}$(r);

For each column c

If c has 1 in row r

For each hash function $h_{i}$ do

If $h_{i}$(r) < M(I,c):

M(I,c):=$h_{i}(r)$

end

STEP 3: LOCALITTY SENSITIVE HASHING (LSH)

Even after using minhashing to compute signatures, comparing all possible columns can be expensive. This is where LSH comes into picture. What LSH basically does is that from all possible pairs of signatures it generates a small list of candidate pairs who have a higher chance of being similar.

This is done by dividing the matrix M into b bands of r rows each as shown below:

For each hash function hash its portion of each column to a hash table with k buckets. Candidate column pairs are those that hash into the same bucket for more than one band. Of course there are bound to be few false negatives. But we can always tune b and r to catch most similar pairs and few non similar pairs.

Data Science

Saturday, March 26, 2016

Finding Similar Articles From A Massive Dataset

No comments:

Post a Comment