So, to paint the scene for my next series of posts, my current research involves using semantic indexing, combined with syntactic models, to look for analogies in nineteenth-century works. I’ll explain why at a later point — for now, I’d just like to lay out the software I’ve been using, and where I’m taking the project.
My current research, which I presented at ACA 2009 (a computational algebra conference) uses two main suites of tools. For the semantic indexing, I used the tools made available at CU-Boulder’s LSA lab. Semantic indexing proceeds by tokenizing a large database words, and getting the term-document frequency counts (counting how many times all of the words occur in each document). Then, using a technique called partial singular value decomposition, this matrix is reduced to a smaller matrix that effectively sifts through the co-occurrence statistics to try and sort out which relationships between the terms are most informative about the structure of the data set. Once you’ve got this index, you can come up with a rough representation of the meaning of a term or sentence by adding together the singular value vectors for each term. And you can describe differences in meaning in terms of the cosine of the angle between those vectors. The technique has proven very effective at, for instance, naive selection of synonyms.
The other tool I used was a part of speech tagger called Morphadorner, developed by the MONK project group. Morphadorner is just a small part of the software suite underpinning MONK, which includes a relational database and some built-in analysis tools derived from MEANDRE/SEASR. I like Morphadorner because it’s both trainable, and comes with a preset for tagging nineteenth-century fiction, which is largely what I’m interested in.
In the short term, I used these tools to do an analysis of the distribution of analogies in the 1859 text of On the Origin of Species, in order to investigate whether this approach could add support to some speculations about the role that analogy plays in that work and in scientific writing generally.
But there are several weaker aspects of this work. First, the semantic indexing tools at the CU Boulder site are limited, particularly by the training corpuses used for their singular value tables. I focused upon a general knowledge training set that covers several years of modern undergraduate course readings, because this seemed to include a better mix of both general and specialist knowledge for looking at scientific works. But it’s clearly problematic to use this library for analyzing nineteenth-century science, with its particular idioms, vocabulary, and habits of expression. What I need to do is create my own corpus of nineteenth-century works, preferably including a broad swathe of fictional, periodical, and scientific texts. Additionally, it would be nice if I could slice up that corpus in various ways, in order to examine the differences between, say, fictional and scientific corpuses, or earlier and later.
In addition, I need to do some additional training/verification of Morphadorner to make sure it’s tagging nineteenth-century scientific works properly, as well as the fiction. Hence the current project.