Word counting, squared

The Google n-gram viewer has become a common starting point for historical analysis of word use. But it only tells us about individual words, with no indication of their context or meaning. Several months ago the Hathi Trust Research Center released a dataset of page-level word counts extracted from 250,000 out-of-copyright books. I’ve used it to build a word similarity tool that tracks word co-occurrence patterns from 1800 to 1923. In the default example we see that “lincoln” is a town in England until around 1859, when it becomes a politician. In this article I’ll describe how I made this tool, and what’s wrong with it.

What’s in the data?

Ted Underwood has a nice introduction to the data on his blog. The goal of this dataset is to find out how much we can learn from “non-expressive” representations of books, which cannot be used to reconstruct the original text.

For each of the 250,000 volumes, the data set contains a file in JSON format. The file begins with basic metadata, a link to a richer MARCXML metadata file, and some additional automatically generated information such as the predominant language and the page count. There is then a list containing one object for each page. Each page has been automatically divided into a header, a body, and a footer. For each of these sections we have the number of words, the number of sentences, a count of letters at the beginnings and ends of lines (designed to identify indexes, poetry, and other unusual pages), and word counts. All word tokens have been tagged with parts of speech using Penn tagger labels, so we get counts of word/POS pairs.

This representation has no information about word order, so we cannot say anything beyond the level of unigrams. From the data, it is highly likely that “United” and “States” have something to do with each other, but we can’t say whether they form a phrase, and even then, we can’t tell for sure whether it’s “United States” or “States United”.

There are many sources of error: OCR (“tlie” is a frequent word), page segmentation, language identification, part of speech tagging, etc. Year of publication data is noisy for a variety of reasons (Julius Caesar was publishing extensively throughout the 19th century). Many “words” consist of only punctuation, like “……..”. I ignore pages with fewer than ten tokens and fewer than five sentences to focus on running text. I throw out headers and footers.

Yes, but what’s in the data?

We don’t really know.

The 250,000 volumes are, as far as I can tell, a random selection from the Hathi Trust collection, which was created in partnership with Google by driving trucks up to the loading docks of university libraries every day for several years. The first question any corpus linguist (or historian or anyone else) will ask is what the collection consists of. Who wrote these books? Why? What are they about? Who originally decided to buy these books? Who decided to scan them? What isn’t there?

How did you choose the vocabulary?

The total number of distinct words is in the millions. Most of those are extremely infrequent. I remove words that don’t occur on at least 2000 pages. On the other end of the frequency spectrum I removed a hand-curated list of very frequent words. I lowercased all words. I use the POS tags to filter as well, removing everything except nouns, adjectives, and verbs. This leaves about 32,000 words.

What does “similarity” mean?

I represent each word as a histogram of co-occurring words, and measure approximate cosine similarity between these vectors. After selecting the vocabulary, I went through each volume published in a given year and counted the number of pages that contained each pair of words. So for the word “lincoln”, I record that it occurs on x pages with “abraham”, y pages with “cathedral”, and so on.

For each word I then have a vector in 32,000-dimensional space. The cosine of the angle between two such vectors is a standard measure of similarity in information retrieval. If the length of each vector has been normalized to one, calculating this cosine similarity for two vectors is equivalent to multiplying each of the 32,000 pairs of elements of the two vectors together and adding them up.

If two words often occur on the same page as “cathedral”, the value for “cathedral” in both vectors will be large, so multiplying those numbers together will give you a large number. If two words never occur with any of the same words, their similarity will be zero.

To save time and memory, rather than storing all these numbers I use locality-sensitive hashing (LSH). After counting the in-vocabulary words on a given page, I multiply those words by 768 sparse, random vectors, and then add the resulting short vector to the word’s representation. In the end, I only store the sign (positive or negative) of the value, which takes up a single bit. (I picked the number 768 because it is a large-ish multiple of both 8 and 6, so I can use 8-bit bytes in memory and 6-bit Base64 encodings to transfer data.) Counting the number of shared bits in two LSH vectors is a good approximation of their cosine similarity.

Finally, I found and cached the 20 most similar word for each word in each year. What you see on the page is the ordered list, with most similar words on the left and less similar words to the right. The query word itself is almost always the most similar word, so it shows up in the leftmost column. I show an indication of the level of similarity by changing the lightness of the text, but I’m deliberately not putting too much weight on the numeric values. To get the raw data, replace nearest.html with word.php in the URL.

How does this relate to copyright?

As fascinating as the pre-Calvin Coolidge era is, one or two interesting things may have happened after 1923. But under common interpretations of US copyright law, it cannot be easily determined whether volumes published after this date are or are not under copyright. (Europe seems to be worse; HT wouldn’t show me anything after 1879 from a German IP address this summer.) Since there’s no efficient way to determine the copyright status of 11 million volumes, and institutions are reluctant to expose themselves to legal risk, full text access seems unlikely to be readily available after the magic date. (On the other hand, in 2004 the idea that we could scan and OCR 11 million volumes was patently ridiculous.)

Hathi Trust released the 250,000 volume data set to explore what we can do with representations that cannot be used to reconstruct the original text. There are plans to release similar information for the entire 11 million volume collection. LSH vectors are known to have privacy-preserving properties, although I cannot guarantee anything about the particular methods that I used.

This work is designed to provide an easy-to-use interface to statistics beyond simple word frequencies, statistics that require context-level information. This approach is, at best, only marginally better than word counts. My hope is that it will give you a sense of what is possible, and what we might hope for if and when the full word-count data set is released.