Retrieving relevant information from a large collection of data, for instance using a search engine, is a difficult task. On the one hand we want to be able to use the efficiency of computers to obtain results quickly, on the other hand we would like search results to be as relevant as possible. The discipline Topic modelling can be used to address this problem and in this blog we will see how we can use these mathematical models to construct a topic-based search engine.

Conceptually, when searching for information, users are often more interested in results that match their query on a higher level than just string-matching, which is the method that is used in most classical search engines. These classical search engines look for places in the data where the string of characters of the search query occurs. Matching the query on the topic level, very much as we use it in our everyday life, can yield more relevant results and allows to get a deeper insight in the content and structure of the data in consideration.

The first step is to decide on a mathematical model for topics and their occurrence in the data – we choose Latent Dirichlet Allocation (LDA). To begin with, let us make precise what we understand as a topic in this context. Assume we are given data in form of documents $${W}_d$$, each consisting of $$N$$ words ($$W_{d,1},\dots,W_{d,N}$$) from a given vocabulary. We then define our notion of a topic to be a probability distribution over our vocabulary. This means that, for instance if $${\beta}_1,\dots,{\beta}_k$$ are our topics, we could find one topic which is of the form:

$$\beta_1 = \left[ \begin{matrix}0.04\\0.03\\ \vdots\\ 0.0\end{matrix} \right]\begin{matrix}\leftarrow & \text{equation}\\\leftarrow & \text{integral}\\& \vdots\\\leftarrow & \text{bread}\end{matrix}$$
This could correspond to what we call the topic “Mathematics” in the English language. This seems appropriate since the words “equation” and “integral” are of high relevance to the topic “Mathematics”, whereas the word “bread” is quite the opposite. Having defined this, LDA provides a probabilistic model of how a document of our data is created.

1. Choose topics $${\beta}_1,\dots,{\beta}_K\sim \textrm{Dir}(\eta)$$, where $$\eta\in\mathbb{R}^{+}$$ is a fixed parameter.
2. For $$d=1,\dots,D$$: Choose the topic distribution of document $$d$$ as $${\theta}_d\sim Dir(\alpha)$$, where $$\alpha\in\mathbb{R}^{+}$$ is a fixed parameter that does not depend on $$d$$.
1. Choose the topic of the $$n^{\textrm{th}}$$ word, $$Z_{d,n}\sim\textrm{Multinomial}({\theta}_d)$$.
2. Choose the $$n^{\textrm{th}}$$ word, $$W_{d,n}\sim \textrm{Multinomial}({\beta}_{Z_{d,n}})$$.

This is visualised in the figure below.

This model allows us to use statistical inference algorithms to find estimates for the latent random variables which we are interested in. In particular we can find estimates $$\{\hat{{\beta}}_k\}_{k=1}^K$$ for the topics that are hidden in our data (i.e. for the hidden random variables $${\beta}_k$$), and $$\{\hat{{\theta}}_d\}_{d=1}^D$$ for the proportions to which each topic is present in each document respectively (i.e. for the hidden random variables $${\theta}_d$$).

In fact this is the information that enables us to match topics. If we now want to search our data for a given set of words, e.g.:

$${Q}=[$$”Calculus””, “Variations”, $$\dots]$$
We can proceed as follows: A similar inference algorithm to above allows us to estimate $$\hat{{\theta}}_Q$$ – the topic proportions in our query. Note here we assumed that the query has length $$N$$, but in fact we can easily generalise to queries of smaller lengths. We can then rank our documents by topical relevance to our query. A successful measure for this relevance is the Hellinger distance:

$$\textrm{distance}(\hat{{\theta}}_d, \hat{{\theta}}_Q) = \sum^K_{j=1} \left(\sqrt{{\theta}_{d,j}}-\sqrt{{\theta}_{Q,j}}\right)^2.$$
Here, documents of small Hellinger distance are topically highly related to the search query. This allows us to return the highest ranking results which match our query in topics.

The outcomes of such a search can be surprisingly insightful, as was demonstrated in a research project conducted by a group forming part of the RIPS program 2015 at the University of California, Los Angeles. In this project LDA and a specifically developed extension, mLDA, were applied to build a topic-based search engine prototype for the USC Shoah Foundation’s Visual History Archive. As one striking example the search engine was able to extract the names of war criminals solely from context in the archive’s data, without being explicitly marked beforehand.

A further advantage of topic based search systems built in this way is that these can be used in a hybrid form. For instance a “topic score” (i.e. the Hellinger distance to our query as demonstrated above) and a “string-matching score” can be combined to output results that have highest relevance, depending on the specific requirements of the particular application.

This Guest Blog was contributed by:
Georg Maierhofer, TakeAIM winner 2015

Afterword
This blog evolved out of the above mentioned research project as part of RIPS 2015 at UCLA. The results of the application of such topic-based search engines to the USC Shoah Foundation’s Visual History Archive as well as the development of mLDA are due to the work of Adam Foster (University of Cambrige), Hangjian Li (University of California, Los Angeles), Megan Shearer (University of Arizona) and Georg Maierhofer (University of Cambridge).