One of the most typical problems
in data mining is that of finding similar sets from a huge data set. This can
be similar text articles, web pages, movie rating, or any some other data for
that matter. For the sake of concreteness let us consider similarity between
text articles for this post. The process of finding similar articles can be
broadly broken down into three steps:
1) Shingling
2) Minhashing
3) Locality
sensitive hashing
STEP 1: BREAKING DOWN THE ARTICLE
INTO K-SHINGLES
A k-shingle is basically a
substring of the article of length k. For example the set of 3-shingles for the
string “Thug life” will be {“Thu”, “g l”, “ife”}. The value of k should be
large enough so that the probability of a given shingle appearing in an article
is considerably small. Also while selecting a shingle one should take into
account how typical the article itself is. For example if our data corpus is
something like e-mails then we might take k=5 while if we are dealing with research
papers we might find it safe to take k=9.
It is not very convenient to
manipulate and store strings. So the set of shingles is generally converted
into a sparse matrix called characteristic
matrix in which columns correspond to the data sets we have and rows are
all the possible elements which a document might contain.
There are a lot of other
intricacies and tricks involved in selecting the shingles and then manipulating
them so that they become easy to handle which I may touch upon in future posts,
but the general motivation behind this concept is that “DOCUMENTS THAT ARE INTUITIVELY
SIMILAR MAY HAVE SHINGLES IN COMMON”
STEP 2: MINHASHING
Before getting into minhashing,
it is important to understand what the most convenient definition of similarity
is while finding similar textual articles.
Jaccard similarity: It is defines
similarity of two columns $C_{1}$ and $C_{2}$ as:
Similarity( $C_{1}, C_{2})= \frac{C_{1} \cap C_{2}}{C_{1}\cup C_{2}}$
The basic goal of minhashing is
to compress the large characteristic matrix (obtained after shingling) to short
signatures keeping the (Jaccard) similarity of columns intact.
The way this method works is as
follows. Imagine the rows of the characteristic matrix to be permuted randomly.
For illustration purpose imagine that the rows of the following matrix is
permuted such that the corresponding serial number of each row is the number in
purple adjacent to it.
The row of the signature matrix corresponding to this permutation is the one on the right. This is computed by going serially row by row and substituting the row number in place of the columns which have 1 until all the columns are filled. Choosing different hash functions to generate different random permutations more (for example 100) rows can be added to the signature matrix as follows:
The most interesting thing about
this technique is that it can be proved that the Jaccard’s similarity of any
two columns is same as the probability of getting the same set of numbers in
the signature matrix. Thus expected similarity of two columns is equal to the
Jaccard’s similarity of the corresponding ones in the input matrix. Thus,
longer the signatures, smaller will be the expected error.
All this looks good in theory. However,
actually permuting data sets of even modest sizes (say 1 billion rows) will
take too much of time and space. So we try to be clever here and simulate
permutations without actually permuting the rows. We pick different hash functions,
one for minhash function and pretend that the position of row r in the
permutation is h(r). We keep a slot M (i,c) for each column c and hash function
$h_{i}(c)$.
Observe the following pseudo code and notice
that the intent is that M (i,c) will become the smallest value of $h_{i}$(r) for
which column c has 1 in row r ($h_{i}(r)$ gives the order of rows for
$i^{th}$ permutation)
For each row r do begin
For
each hash function $h_{i}$ do
Compute
$h_{i}$(r);
For
each column c
If
c has 1 in row r
For
each hash function $h_{i}$ do
If
$h_{i}$(r) < M(I,c):
M(I,c):=$h_{i}(r)$
end
STEP 3: LOCALITTY SENSITIVE
HASHING (LSH)
Even after using minhashing to
compute signatures, comparing all possible columns can be expensive. This is
where LSH comes into picture. What LSH basically does is that from all possible
pairs of signatures it generates a small list of candidate pairs who have a
higher chance of being similar.
This is done by dividing the
matrix M into b bands of r rows each as shown below:
For each hash function hash its
portion of each column to a hash table with k buckets. Candidate column pairs
are those that hash into the same bucket for more than one band. Of course
there are bound to be few false negatives. But we can always tune b and r to
catch most similar pairs and few non similar pairs.
No comments:
Post a Comment