This post is a departure from my usual post format. Instead of walking through a theoretical topic or recent academic paper, this is intended to be a soft introduction to using Latent Semantic Analysis (LSA) to categorize documents. It’s essentially an extension to the existing tutorial in
sklearn, found here.
I’ll be using nlp_utilities for the walkthrough. Once you have it installed on your computer (and updated your
PYTHONPATH), continue to the section below, which gives a theoretical introduction to LSA.
What is LSA?
Latent Semantic Analysis, or LSA, is an approach to discovering the hidden “topics” that exist across a set of documents. The basic assumption is words that co-occur in the same document are similar, and documents containing the same or similar words are similar: thus, simply by relying on the distributional properties of words in the dataset (which words occur where), you can infer what a document is “about”.
There are two main steps to this process: building a term-document matrix, then reducing the dimensionality of that matrix.
Step 1: Build a term-document matrix
The first step to the LSA algorithm is constructing a term-document matrix. This sounds complicated, but it’s actually just a numerical representation of which terms occur in which documents. The end goal is a vector for each document, signifying the extent to which any given word in your dataset is represented in that document.
There are a few approaches to constructing a term-document matrix. The simplest is count-vectorization. As the name implies, count-vectorizing involves first collecting all the words that occur across your set of documents (let’s say there are 10000 words in all), then counting the number of times that each word occurs in each document. Thus, for each document, you’ll end up with a vector of integers, where the length of that vector is the number of words (10000).
[1, 0, 0, ..., 2]
This is a decent approach, but it may not reflect how important particular words are to particular documents. Another approach is tf-idf (term frequency-inverse document frequency), which attempts to control for how frequently a word is used overall, vs. how frequently a word is used in particular documents.
With tf-idf, we compare the term frequency (e.g. the number of times a word is used in a given document), to the inverse document frequency (the amount of “information” provided by that word). Note that term frequency can be scaled in different ways (e.g. using the log-scale or Boolean frequency for term frequency), but for our purposes, we’ll simply use the raw count.
The inverse document frequency (see left) is obtained by dividing the number of documents overall (N), by the number of documents of documents that contain a word t, then taking the logarithm of this quotient.
Finally, to obtain our tf-idf score for a given term-document relationship, we multiply term frequency by inverse document frequency. The logic here is that terms that are very frequent in a specific document, but that occur very rarely throughout the rest of a corpus, are probably more “representative” of that document’s meaning. For example, in a corpus of business mission statements, words like “customer” (and especially words like “the”) will probably occur across many documents, but words like “fruit” or “vegetable” will only occur in some (E.g. grocery stores or food vendors), as will words like “petroleum” and “oil” (e.g. oil companies).
Step 2: Use singular-value decomposition to discover latent topics
Obtaining a tf-idf matrix is a great first step to building out the semantics of our documents. These vectors can even be used as-is, e.g. for document classification.
But one problem with tf-idf vectors is that they are sparse. If we assume there are 10,000 words that occur across all of our documents, each document will likely only contain a small subset of those words; this means that the vector representation for each document will consist mostly of 0s. This may still prove informative, but it misses valuable relationships between words: by insisting that every word correspond to distinct elements of a vector, we fail to capture the fact that “fruit” and “vegetable” are both examples of food. Thus, if Document 1 only mentions “fruit”, and Document 2 only mentions “vegetable”, those documents would not be identified as similar (whereas if Document 3 mentions both “fruit” and “vegetable”, it will be identified as similar to both Document 1 and Document 2). In other words, our term-document vectors don’t capture the fact that different words can still be related (e.g. synonyms).
LSA attempts to remedy this problem by performing dimensionality reduction on the term-document matrix: given an M x N matrix (M=#terms, N=#documents), LSA constructs a low-rank approximation k x N matrix (k=#topics, N=#documents). Put more simply, we want to turn our sparse term-document matrix into a denser matrix, ideally one where each of our k columns represents a semantic topic.
There are a few well-known approaches to dimensionality reduction. LSA uses singular value decomposition (SVD). In brief, SVD gives us a matrix with k dimensions (where k=#topics), based on the idea that terms with similar meanings will co-occur with other terms and in similar documents (and documents with similar meanings will contain similar terms).
Using LSA in Python
Let’s say you have a set of documents, and you want to classify them by their category (E.g. some are about religion, and some are about computers). One (relatively) easy approach to this problem is to use LSA to:
- Discover the k latent topics across your set of documents.
- Transform each document into a distribution over k topics.
You can then use these topic distributions as input to a machine learning classifier.
With nlp_utilities, it’s pretty straightforward to get started with LSA. First, let’s grab our documents. Fortunately,
sklearn has a good dataset ready to go:
from sklearn.datasets import fetch_20newsgroups
categories = ['talk.religion.misc', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train',
remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,
remove=('headers', 'footers', 'quotes'))
Now, we want to discover our latent topics from our training data. Note that in this situation, it’s very important not to include your test data in the documents that you’re using to build your topic model, because the topics will then be influenced by your test data: a form of data leakage.
Here, you can use the
TopicModeler class to obtain your topics. Under the hood, this is using
sklearn to do all of the things described above: build a term-document matrix, then use singular value decomposition to discover the latent topics.
tm = TopicModeler(lemmatize=False, top_words=5)
topics_train = tm.fit_transform(newsgroups_train.data)
By looking at the words with the highest weight for each of our topics, we can get an informal sense of what each topic is capturing. Let’s look at the first two topics:
['image thanks dont graphics know', 'christian people bible jesus god']
It seems pretty obvious that our first topic is more likely to correspond to posts about computer graphics, and our second topic is more likely to correspond to posts about religion. (Note also, however, that these topic labels are really just the tip of the iceberg, in terms of what the topic model has learned. Each topic actually consists of coefficients for our m words; the five words in each of these labels are just the words with the biggest coefficients.)
Now let’s use the learned topic model to transform our test data (note that we aren’t fitting to our test data, just using the previously fit model):
topics_test = tm.transform(newsgroups_test.data)
We can now extract our training and test data. Remember that we want to use our topic distributions (X) to predict the category of a post (y):
X_train, y_train = topics_train.values, newsgroups_train.target
X_test, y_test = topics_test.values, newsgroups_test.target
Training and test data in hand, you can select your favorite
sklearn classifier. Let’s use
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
We see that our classifier correctly categorized documents in the test data-set 93% of the time. Just to make sure this isn’t because of a wildly unbalanced dataset, we can also extract the F1-score for our model:
from sklearn import metrics
Not too bad.
Hopefully you now have a slightly improved idea of the utility of a topic model, what LSA in particular does––and how you could implement it in Python if you needed to.
LSA topics proved to be a very useful feature for our classifier in discriminating between posts about computer graphics and posts about religion. Given that LSA topics are, at their core, constructed from the patterns of word distributions, this isn’t that surprising: posts about computer graphics probably use many different words than posts about religion. A much more challenging problem would be to use LSA topics to discriminate between posts about computer graphics and posts about computer games, or posts about religion and posts about atheism. There, relying on which words occur in which documents may be less helpful.
This raises an important limitation to topic models like LSA. While distributional properties of a word can be immensely informative about its meaning (see this post on the utility of using distributional semantic models in a research setting), an approach like LSA ignores other fundamental properties of language, such as the order in which these words are combined (E.g. syntax). The “meaning” of a sentence is not simply the sum of the words that appear in it: the order of those words influences the meaning that emerges. As a simple example, Bob killed the lion has a different semantics than the lion killed Bob. Thus, representing the meaning of a document using a “bag of words” approach (which is essentially what tf-idf, and by extension LSA, does), elides any semantic differences that arise due to word order, and especially sentence order.
Understanding this kind of order-dependent meaning requires encoding sequential information in your model, which makes things a good deal more complex––and more interesting.