As a part of the information retrieval
course that I am taking in my college this semester, I had the opportunity to study about the Vector Space Model (VSM) of
information retrieval systems, among other things. As a part of an assignment
and also to reinforce my own understanding I coded up my own search engine in
python. The corpus I used was the standard Reuters corpus available in NLTK,
consisting of about 10000 documents and 53000 unique tokens after normalization
(see case folding, stemming, lemmatization, etc).
INDEXING:
After tokenization and normalization (done using the inbuilt functions of the NLTK library), the
next step in the process of building an inverted index. I kept it very simple
here, using a python dictionary (implemented by a hash table) with key as each
term and corresponding value to be the list of doc ids that in which that
particular term appears. This index handles normal queries effectively, but proves helpless in the case of prefix and suffix wild card queries. To take
care of those I used an additional index built using 'Ternary Trie Data Structure'. For taking
care of spelling mistakes, I used the Levenshtein Distance technique of dynamic
programming. Unlike rest of the project which was done in python, the trie and spelling correction
modules were developed in C++.
The Vector Space Model (VSM) :
What VSM essentially does is that it uses
some method of feature extraction to represent each document as a vector. So
when a query is given by the user, the query also, is converted into a vector
in the same space. Cosine distances between the query and each document is
computed and the top K most similar documents are retrieved.
The vector representation of documents is
done by computing tf-idf scores of each term in the document. The document
vector is a V dimensional vector (where V is the size of vocabulary of the
corpus) and each term in the document weighted by the multiplication of its tf
and idf scores. Here, tf is the term frequency, and it is a measure of how many
times the word appears in that particular document. While idf is the inverse
document frequency, and it is a measure of how rare the word is in the corpus. For
example uncommon words like, say, ‘Pillsbury’ will have a high IDF while common
stop words like ‘a’, ‘the’, etc. will have a small idf. So basically by
multiplying the tf and idf wights, we get a measure of how many times a rare
word appears in a document. And of course, each vector is length normalized at
the end.
To store the idf and tf values I used a
dictionary and a dictionary of tuples respectively.
Also note that most documents don’t have
most words. So if at all you want to pre-compute the doc vectors, try using sparse
matrix representation using scipy.
Context Based Query Expansion using word2vec model:
Even though the tf-idf scoring scheme gives
results of modest relevance, it fails to take into account the context of the
query. For example, if the query contains the term ‘good’ it will only search
for that particular word and give zero weight to contextually similar words
like ‘positive’, ‘better’, etc. For this I used the word2vec model to retrieve
documents intuitively similar to the query.
Word2vec models come in two flavours-
continuous bag of words model and skip gram model. These models deserve and
will get a separate post of their own, but the high level idea is that they use
neural networks to learn the vectors of words such that the vector captures the
context of the word, where context of a word is defined by the a window of words
surrounding that word. I used the gensim package which is a python library that
uses deep learning to create word embeddings using the skip gram model and
continuous bag of words model. After the model was trained on the corpus it
could output top similar words for each term in the corpus. So, along with the
query terms it also appends top two similar words for each query term. Since
the reuters data set is relatively smaller than what is required to train a
very good deep learning model, the top similar words are not always acurate. So
even if it increases the recall, precision might get reduced for some queries.
Hence, inorder to give more weight to the original terms in the query, the
weights of the additional terms is halved.
Here are a few results after training on my
corpus:
Notice that these words are not exactly synonyms (which could have been much easier by cross referencing with an online dictionary), but words that appear in similar context, more precisely, words that have a higher probability of occurring in the same document. So appending these words to our query increases our chance of getting a higher recall and precision.
For example see the difference in top query result with (In [53]) and without (In [54]) the addition of word2vec feature for a test query 'bad crops'.