This is tongue-in-cheek, but during dinner last night we were wondering how long it would take Martin to finish the Song of Ice and Fire series. So I fit it with a polynomial line and projected out. It looks like, if the erstwhile trilogy is limited to the seven books as currently projected, he should finish in early 2027, just in time for his 80th birthday. But if trends hold and either of the last two books are split in two, we’re looking at 2037 — when he’s about to turn 90. Here’s hoping.
Uncategorized
Amazon “Academic AMIs” find a community
There’s been some larger interest in using Amazon AMIs to do academic research. I took my work installing the SEASR MEANDRE development infrastructure as an Amazon AMI instance to the folks over at SEASR/NCSI last Spring, in time for their visit to the U. Victoria DHSI (which I had the great pleasure of attending back in 2010). They shared the AMI work there, and my friend Jason Boyd (fellow DHSI class ’10) took it over to THATCamp Victoria the following week. This sparked the intests of James Smithies, who coined the term “Academic AMIs” and launched this site to try and support the use of various academic software packages using the AMI infrastructure. I think it’s a fabulous idea, and hope I can encourage some others to head over to James’ site and lend a hand.
It also brings up an infrastructure problem that I’ve been working with here at U Penn. I’ve been consulting with the excellent IT group here at the University in working on a few different projects, including MEANDRE, archive hosting, and demoing the use of TILE in the classroom. The challenge we’ve kept running in to is that it’s much easier to get a net-based project up and running if it is hosted outside the university, because of various security concerns and issues. For initial development, this isn’t so much of a problem, but when you start talking about longer-term projects (even if small), and University support, it gets more complicated. Continue reading
Research Project
So, to paint the scene for my next series of posts, my current research involves using semantic indexing, combined with syntactic models, to look for analogies in nineteenth-century works. I’ll explain why at a later point — for now, I’d just like to lay out the software I’ve been using, and where I’m taking the project.
My current research, which I presented at ACA 2009 (a computational algebra conference) uses two main suites of tools. For the semantic indexing, I used the tools made available at CU-Boulder’s LSA lab. Semantic indexing proceeds by tokenizing a large database words, and getting the term-document frequency counts (counting how many times all of the words occur in each document). Then, using a technique called partial singular value decomposition, this matrix is reduced to a smaller matrix that effectively sifts through the co-occurrence statistics to try and sort out which relationships between the terms are most informative about the structure of the data set. Once you’ve got this index, you can come up with a rough representation of the meaning of a term or sentence by adding together the singular value vectors for each term. And you can describe differences in meaning in terms of the cosine of the angle between those vectors. The technique has proven very effective at, for instance, naive selection of synonyms.
The other tool I used was a part of speech tagger called Morphadorner, developed by the MONK project group. Morphadorner is just a small part of the software suite underpinning MONK, which includes a relational database and some built-in analysis tools derived from MEANDRE/SEASR. I like Morphadorner because it’s both trainable, and comes with a preset for tagging nineteenth-century fiction, which is largely what I’m interested in.
In the short term, I used these tools to do an analysis of the distribution of analogies in the 1859 text of On the Origin of Species, in order to investigate whether this approach could add support to some speculations about the role that analogy plays in that work and in scientific writing generally.
But there are several weaker aspects of this work. First, the semantic indexing tools at the CU Boulder site are limited, particularly by the training corpuses used for their singular value tables. I focused upon a general knowledge training set that covers several years of modern undergraduate course readings, because this seemed to include a better mix of both general and specialist knowledge for looking at scientific works. But it’s clearly problematic to use this library for analyzing nineteenth-century science, with its particular idioms, vocabulary, and habits of expression. What I need to do is create my own corpus of nineteenth-century works, preferably including a broad swathe of fictional, periodical, and scientific texts. Additionally, it would be nice if I could slice up that corpus in various ways, in order to examine the differences between, say, fictional and scientific corpuses, or earlier and later.
In addition, I need to do some additional training/verification of Morphadorner to make sure it’s tagging nineteenth-century scientific works properly, as well as the fiction. Hence the current project.
First Post
At the suggestion of a new-found friend and digital humanities compatriot (Matt Wilkins), I’m starting up a blog to keep track of my d.h. work, reading, and reflections. Over the next week or so, I’ll catch the blog up on the various projects I’ve been fiddling with over the past year, and suggest some of the avenues I’ll be pursuing. My goals are two-fold; to make the work (and the code associated with it) available to other researchers, and to hash out that work in a less formal environment. We’ll just have to see where all of this leads.