Talking TED (“Understanding Analogy: Theory and Method”)

A few months ago the Information Sciences Institute here at USC invited me to talk at one of their weekly Natural Language Seminars. They knew I’d been working on theorizing and analyzing analogies digitally, and wanted to hear more.

It was an exciting but daunting opportunity. How would I speak to an audience that thought about language and procedures for studying it in a radically different way? Several years ago I gave a talk like this at a conference for the Association of Computational Algebra. It didn’t go over well.

This time, I decided to experiment with a TED-style talk. There’s been a lot of criticism of the TED format. Most of it centers on whether the talks are accurate and informational or simply entertainment. Some do seem to be the intellectual equivalent of cotton candy — tasty but evanescent. But they also, I think, are a model for how to talk to a wider audience and enlist interest across cultural, institutional, and disciplinary boundaries.

So I studied up. There’s Nancy Duarte’s TED talk on TED talks, and Chris Anderson, a TED coach, has also shared his recipe. I think it boils down to three things. First, use biography (yours or another’s) to tell a coherent story that centers on the problem you work on. Second, have a clear transition from the problem to your answer. And finally, emphasize why that answer is powerful — what it changes about how we see the problem, and what it might mean for others. To put it differently, they rely on an analogy drawn between a personal narrative and a larger problem.

Put this way, it’s a recipe that applies to most of the good talks that I’ve seen, except TED talks are more personal and less complex. You have to put yourself forward and abandon qualifications, hedges, and the basic acknowledgement that others have been working on similar problems, often more successfully.

Despite discomfort with the TED format, I’ve been trying to figure out how I can get my scholarship out to a wider audience, especially communities beyond academia. This seemed like a great opportunity to experiment.

So I sat down and hammered it out. Meg was out of town, which meant that most of the writing happened with my daughter in my lap, and we practiced with her in the baby bjorn (she’s my biggest fan).

The final title: “Understanding Analogy: Theory and Method.” The folks at ISI posted it here. It doesn’t quite live up to the billing, but it worked. My auditors generally agreed that analogies are an important feature of new ideas and that I’d found a new way of looking at them. And since that talk we’ve been talking about collaborating on a machine learning tool that finds analogies. I’m recruiting undergrads for some initial work this summer. It will be exciting to see where this leads.

Research Project

So, to paint the scene for my next series of posts, my current research involves using semantic indexing, combined with syntactic models, to look for analogies in nineteenth-century works. I’ll explain why at a later point — for now, I’d just like to lay out the software I’ve been using, and where I’m taking the project.

My current research, which I presented at ACA 2009 (a computational algebra conference) uses two main suites of tools. For the semantic indexing, I used the tools made available at CU-Boulder’s LSA lab. Semantic indexing proceeds by tokenizing a large database words, and getting the term-document frequency counts (counting how many times all of the words occur in each document). Then, using a technique called partial singular value decomposition, this matrix is reduced to a smaller matrix that effectively sifts through the co-occurrence statistics to try and sort out which relationships between the terms are most informative about the structure of the data set. Once you’ve got this index, you can come up with a rough representation of the meaning of a term or sentence by adding together the singular value vectors for each term. And you can describe differences in meaning in terms of the cosine of the angle between those vectors. The technique has proven very effective at, for instance, naive selection of synonyms.

The other tool I used was a part of speech tagger called Morphadorner, developed by the MONK project group. Morphadorner is just a small part of the software suite underpinning MONK, which includes a relational database and some built-in analysis tools derived from MEANDRE/SEASR. I like Morphadorner because it’s both trainable, and comes with a preset for tagging nineteenth-century fiction, which is largely what I’m interested in.

In the short term, I used these tools to do an analysis of the distribution of analogies in the 1859 text of On the Origin of Species, in order to investigate whether this approach could add support to some speculations about the role that analogy plays in that work and in scientific writing generally.

But there are several weaker aspects of this work. First, the semantic indexing tools at the CU Boulder site are limited, particularly by the training corpuses used for their singular value tables. I focused upon a general knowledge training set that covers several years of modern undergraduate course readings, because this seemed to include a better mix of both general and specialist knowledge for looking at scientific works. But it’s clearly problematic to use this library for analyzing nineteenth-century science, with its particular idioms, vocabulary, and habits of expression. What I need to do is create my own corpus of nineteenth-century works, preferably including a broad swathe of fictional, periodical, and scientific texts. Additionally, it would be nice if I could slice up that corpus in various ways, in order to examine the differences between, say, fictional and scientific corpuses, or earlier and later.

In addition, I need to do some additional training/verification of Morphadorner to make sure it’s tagging nineteenth-century scientific works properly, as well as the fiction. Hence the current project.