Saturday, October 29, 2016

Text Based Information Retrieval System

As a part of the information retrieval course that I am taking in my college this semester, I had the opportunity to study about the Vector Space Model (VSM) of information retrieval systems, among other things. As a part of an assignment and also to reinforce my own understanding I coded up my own search engine in python. The corpus I used was the standard Reuters corpus available in NLTK, consisting of about 10000 documents and 53000 unique tokens after normalization (see case foldingstemming, lemmatization, etc).

INDEXING:


After tokenization and normalization (done using the inbuilt functions of the NLTK library), the next step in the process of building an inverted index. I kept it very simple here, using a python dictionary (implemented by a hash table) with key as each term and corresponding value to be the list of doc ids that in which that particular term appears. This index handles normal queries effectively, but proves helpless in the case of prefix and suffix wild card queries. To take care of those I used an additional index built using 'Ternary Trie Data Structure'. For taking care of spelling mistakes, I used the Levenshtein Distance technique of dynamic programming. Unlike rest of the project which was done in python, the trie and spelling correction modules were developed in C++.




The Vector Space Model (VSM) :


What VSM essentially does is that it uses some method of feature extraction to represent each document as a vector. So when a query is given by the user, the query also, is converted into a vector in the same space. Cosine distances between the query and each document is computed and the top K most similar documents are retrieved.
The vector representation of documents is done by computing tf-idf scores of each term in the document. The document vector is a V dimensional vector (where V is the size of vocabulary of the corpus) and each term in the document weighted by the multiplication of its tf and idf scores. Here, tf is the term frequency, and it is a measure of how many times the word appears in that particular document. While idf is the inverse document frequency, and it is a measure of how rare the word is in the corpus. For example uncommon words like, say, ‘Pillsbury’ will have a high IDF while common stop words like ‘a’, ‘the’, etc. will have a small idf. So basically by multiplying the tf and idf wights, we get a measure of how many times a rare word appears in a document. And of course, each vector is length normalized at the end.
To store the idf and tf values I used a dictionary and a dictionary of tuples respectively.
Also note that most documents don’t have most words. So if at all you want to pre-compute the doc vectors, try using sparse matrix representation using scipy.

Context Based Query Expansion using word2vec model:


Even though the tf-idf scoring scheme gives results of modest relevance, it fails to take into account the context of the query. For example, if the query contains the term ‘good’ it will only search for that particular word and give zero weight to contextually similar words like ‘positive’, ‘better’, etc. For this I used the word2vec model to retrieve documents intuitively similar to the query.

Word2vec models come in two flavours- continuous bag of words model and skip gram model. These models deserve and will get a separate post of their own, but the high level idea is that they use neural networks to learn the vectors of words such that the vector captures the context of the word, where context of a word is defined by the a window of words surrounding that word. I used the gensim package which is a python library that uses deep learning to create word embeddings using the skip gram model and continuous bag of words model. After the model was trained on the corpus it could output top similar words for each term in the corpus. So, along with the query terms it also appends top two similar words for each query term. Since the reuters data set is relatively smaller than what is required to train a very good deep learning model, the top similar words are not always acurate. So even if it increases the recall, precision might get reduced for some queries. Hence, inorder to give more weight to the original terms in the query, the weights of the additional terms is halved.
Here are a few results after training on my corpus:


 Notice that these words are not exactly synonyms (which could have been much easier by cross referencing with an online dictionary), but words that appear in similar context, more precisely, words that have a higher probability of occurring in the same document. So appending these words to our query increases our chance of getting a higher recall and precision.
For example see the difference in top query result with (In [53]) and without (In [54]) the addition of word2vec feature for a test query 'bad crops'.