Labels and Patterns

Written by David Mimno

I’ve been using this blog as a more philosophical platform, this is going to be about some new features in the machine learning package that I work on, Mallet. One of these, LabeledLDA, is some code that I’ve had lying around for a few years. The other, stop patterns, is a simple addition that may be useful in vocabulary curation for text mining. You’ll need to grab the latest development version from GitHub to run these.


The first tool, LabeledLDA, is a simplified topic model for use when documents have labels, and labels correspond to topics. When each document in a text collection has exactly one label, you can estimate a Naïve Bayes model by counting all the words in the subset of documents that have a given label value. When you have no labels, a topic model like Latent Dirichlet Allocation can find semantically coherent “topics”. But what if you have multiple labels per document? For example, consider a 5-star review of a Chinese restaurant. We can think of this as a document with three labels: 5-star, Chinese, and Restaurant. Some of the words in the review might be about being a great restaurant, others might be about Chinese cuisine, and others about restaurants in general. LabeledLDA (Ramage et al., EMNLP 2009; see also McCallum, Multi-label text classification…, 1999) is a way of measuring this kind of connections between words and labels.

LabeledLDA asserts a one-to-one relationship between labels and topics. Every document is a mixture of the words associated with its labels, and nothing else. This assumption simplifies inference because in a given document we only have to assign each word token to one of the labels for that document, rather than the entire set of topics. An additional advantage is that each topic has, by definition, an existing label, so we can skip the “squinting at it” stage of fully unsupervised topic model analysis. The cost of that assumption is that we may miss emergent patterns or language that’s not well modeled by the labels. There has been work on models that make looser connections between labels and topics, but that’s a question for another day.

Running LabeledLDA requires that we have documents with multiple labels. The easiest way to do this in Mallet is with the new “–label-as-features” option in the “import-file” command. The input for this command is a text file with one line per document and three columns: an ID, a set of labels, and the text. By default, Mallet interprets the first two whitespace-delimited fields of each line as the ID and label. That’s a problem because we’d like to use some whitespace to delimit our multiple labels, but it’s easy to fix by changing the line-parsing regular expression to separate columns with tabs, and thus allow spaces to exist in the label field. Here’s a full command line example:

bin/mallet import-file --input yelp-short.txt --output yelp-short.seq --stoplist-file yelp.stops --label-as-features --keep-sequence --line-regex '([^\t]+)\t([^\t]+)\t(.*)'

As an example I’m going to use the fourth round Yelp dataset — it’s freely available, but please go to the source. As labels or “features”, I’m using the tags, city and region for each business, and the review stars. From a subset of 100,000 reviews, this results in 553 labels. Here’s the command to run a model and save “topic keys” to a file:

bin/mallet run cc.mallet.topics.LabeledLDA --input yelp-short.seq --output-topic-keys yelp-llda.keys

The command supports most of the input and output options from standard Mallet LDA, in many cases with the exact same code. Use the “–help” option to find out more. Sharing code makes it easy to add functionality, but sometimes causes problems — for example the held-out probability evaluator and topics-for-new-documents inferencer don’t pay attention to document labels.

The “keys” file contains one line per label/topic. The first column is the topic ID number, corresponding to the original order that each label first appeared in the training data. The second column is the label string. The third column is the total number of tokens assigned to that topic at the particular Gibbs sampling state where we stopped. The last column is a space-delimited list of 20 words in descending order by probability in the topic. Topics with very few total words may have less than 20 words.

Here are some examples of high-frequency topics:

10      Restaurants     727089  salad sauce meal cheese table dinner wasn't side server bit both flavor taste took served tasted something tasty hot though 
232     Sushi_Bars      38441   sushi roll rolls fish happy hour tuna spicy sashimi salmon favorite sakana quality nigiri tempura chefs places japanese special rice 
29      Hotels_&_Travel 30017   room hotel stay desk rooms front stayed day bed marriott check pool property booked area clean airport car checked resort 
32      Hotels  29163   room hotel stay breakfast rooms free clean pool bed stayed comfortable bathroom area shower inn suites desk hotels hot water 
16      Chinese 51834   chinese rice fried soup beef egg sauce hot dishes pork shrimp sour spicy dish crab orange noodles delivery china mein 

Moderate-frequency topics are also pretty good, although there are some specific words mixed in like “Zoe’s” for “Southern”:

477     Dim_Sum 730     dim sum c-fu dimsum carts phoenix cart variety dim-sum feet palace balls items xmas tripe tradition leaf lotus steamed tarts 
298     Office_Equipment        728     office staples max ink officemax print printer cartridge toner printing cartridges supplies printed router copy computer file copies depot computers 
113     Videos_&_Video_Game_Rental      724     movies video movie blockbuster netflix films rent paradise releases rentals selection section games title rental foreign dvd titles film adult 
468     Apache_Junction_AZ      721     apache junction elvira's lake canyon patio bloody mechanic girlfriend mary gringos lglg&c trail hiking miles beautiful truck hacker's locos cantina 
155     Home_Decor      659     tile candles boone granite holland slab stone rocks kitchen slabs wood pewter minerals stones rock world decorate molding marling linen 
234     Southern        657     zoe's sandwich cake salad zoes feta chocolate slaw pasta grilled healthy tuna potato kabobs dry limeade chips pimento sandwiches roll-ups 
471     Gelato  648     gelato angel sweet flavors chocolate dark cotta panna italy hazelnut peanut flavor super texture taste creamy size dairy sweet's gelatos 

Rare topics are increasingly noisy:

423     Interior_Design 8       copehaigen whatzits registries adhd distain untrained charlotte step 
410     Airlines        8       unshaven sweatshirt csa eqipment scruffy hooded fbo vending 
380     Russian 8       fsu embarrasses semi-salted latvia selection.i yearning jewish 
552     Divorce_&_Family_Law    6       pontoni attornies results.there jea retained 
506     Motorcycle_Rental       6       registration breach brags corky harley 
504     Tutoring_Centers        6       peripherals really-really-smart clock-mini-speaker reassurances pro's 

The star-rating labels are particularly interesting.

4       5_star  255897  amazing years every recommend awesome family excellent highly everything favorite wonderful times prices many can't fantastic feel most everyone day 
6       4_star  204279  times prices usually area location bit its most price many every find years lot though favorite enjoy you're recommend excellent 
11      3_star  92213   decent nothing bad bit prices stars though times average its area overall price however lot location most wasn't probably give 
5       2_star  85203   minutes asked told another table took bad wasn't times server last manager finally waited not
7       1_star  213432  told asked minutes manager customer another rude called him took worst bad money business horrible left his call give finally 

To me these look great. Five-star reviews have lots of positive adjectives, one-star reviews are narratives about employees who were rude or slow, and three-star words seem to capture the essence of ho-hum averageness. But they’re also not that different from the overall sub-corpus frequencies. The top words for five-star reviews (minus stopwords) are amazing, years, every, recommend, awesome, favorite, everything, and day. The most significant difference seems to be family, which is the 19th most common word and occurs 25% less often than everything, but ranks higher within this five-star topic. This result is stable — running the algorithm again with a random initialization results in essentially the same order with a few word pairs swapped.

Stop Patterns

There was a request on the topic list for a way to remove patterns of words rather than just specific stopwords. I’ve added code to do that.

Here’s a sample regular expression file. Note that I’m using the Java matches function, so a pattern must account for the entire token — thus the trailing .*.


The first is an attempt to remove English adverbs, but may not be ideal for people named Kelly or my friends at Yummly. The second looks for a prefix. The third looks for multiple dots, which seems to be a common pattern in some Yelp reviews. If there is a problem with a regular expression, Mallet will print an error and will ignore all subsequent patterns, but will not stop processing. Here’s a sample file with examples:

1    X    this is a terribly useful test for....things involving photographs, photographers, and photography called test.txt

Here’s what happens when I run the standard method:

% bin/mallet import-file --input test.txt --print-output
name: 1
target: X
input: this(0)=1.0

Notice that the new default token pattern ignores one- and two-letter words (is, a), but includes tokens that contain punctuation, which picks up URLs, apostrophes, and hyphens. Keep in mind these interactions with the tokenization pattern when designing regular expressions. Here’s what the output looks like afterwards.

% bin/mallet import-file --input test.txt --stop-pattern-file stoplists/patterns.txt --print-output
name: 1
target: X
input: this(0)=1.0

Using stop patterns involves creating a Matcher object for each token and each pattern, and checking that pattern against the token string. In contrast, static stopword removal is a hash lookup. I haven’t measured performance yet, but I would guess that stop patterns should be used sparingly.