Sampling the Novel (Dickens and Network “Connexion”)

A few weeks ago statistician Andrew Gelman posted an article that used Dickens’s social novels as an example of the perils of sampling networks (h/t to Jonathan Stray and Andrew Piper for tweeting about this). Whereas, in statistical methodologies, you can “sample” a larger diffuse or “atomisticcollection and get an accurate picture of what the larger group looks like, when sampling a few points in a large network, those samples give a very poor picture of the larger network structure. It’s a bit like the difference between picking a handful of M&M’s out of a bag and making an inference about the total color distribution (reasonably accurate), and sampling a handful of molecules within an M&M and making assumptions about what the larger shape, taste, paint pattern, etc. look like. The former doesn’t have much structure, but the latter does — and that structure matters.

Here’s how Gelman applies this to Dickens:

In traditional survey research we have been spoiled. If you work with atomistic data structures, a small sample looks like a little bit of the population. But a small sample of a network doesn’t look like the whole. For example, if you take a network and randomly sample some nodes, and then look at the network of all the edges connecting these nodes, you’ll get something much more sparse than the original. For example, suppose Alice knows Bob who knows Cassie who knows Damien, but Alice does not happen to know Damien directly. If only Alice and Damien are selected, they will appear to be disconnected because the missing links are not in the sample.

This brings us to a paradox of literature. Charles Dickens, like Tom Wolfe more recently, was celebrated for his novels that reconstructed an entire society, from high to low, in miniature. But Dickens is also notorious for his coincidences: his characters all seem very real but they’re always running into each other on the street (as illustrated in the map above, which comes from David Perdue) or interacting with each other in strange ways, or it turns out that somebody is somebody else’s uncle. How could this be, that Dickens’s world was so lifelike in some ways but filled with these unnatural coincidences?

My contention is that Dickens was coming up with his best solution to an unsolvable problems, which is to reproduce a network given a small sample. What is a representative sample of a network? If London has a million people and I take a sample of 100, what will their network look like? It will look diffuse and atomized because of all those missing connections. The network of this sample of 100 doesn’t look anything like the larger network of Londoners, any more than a disconnected set of human cells would look like a little person.

So to construct something with realistic network properties, Dickens had to artificially fill in the network, to create the structure that would represent the interactions in society. You can’t make a flat map of the world that captures the shape of a globe; any projection makes compromises. Similarly you can’t take a sample of people and capture all its network properties, even in expectation: if we want the network density to be correct, we need to add in links, “coincidences” as it were. The problem is, we’re not used to thinking this way because with atomized analysis, we really can create samples that are basically representative of the population. With networks you can’t.

Gelman goes on to argue that all of the supposed “coincidences” of a Dickensian novel are an attempt to simulate network structure or “links” where the number of sampled nodes are too small to fill out a real map of the network’s structure. So coincidences simulate what would be major linkages in the actual network of London ca. 1850.

It’s a cool idea — and it gets right to the heart of the famous question posed by the narrator of Dickens’s Bleak House:

What connexion can there be between the place in Lincolnshire, the house in town, the Mercury in powder, and the whereabout of Jo the outlaw with the broom, who had that distant ray of light upon him when he swept the churchyard-step? What connexion can there have been between many people in the innumerable histories of this world who from opposite sides of great gulfs have, nevertheless, been very curiously brought together!

But for Dickens, “connexion” obviously means more than association between characters. It has moral, filial, and in Bleak House, even epidemiological dimensions. One of the questions that launched my book, The Age of Analogy, was to ask what connected the various discursive registers that operate in Bleak House — what connects the legal system of the Court of Chancery to the salvage economy of Krook’s Court; what links the virtue of Esther Summerson’s narrative position to the small pox that sickens her? (Ultimately, I came to believe that one thing that links them as a new way of thinking about analogy — between characters, social formations, and discursive vocabularies — as a way to get at the sedimentary nature of history and social formations. I say “believe” because, along the way, Bleak House & Dickens fell out of the project.)

But whether this is true, Dickens’s characters do not operate “atomistically” or even as atoms linked by coincidence. What they do and how they interact displays a great deal of structure that is not pure invention. One way to get at this is that the network model has a dispersed physical and temporal dynamic that doesn’t lend itself to thinking about narrative.  Narratives are not links, though narratives may feature interactions between characters (moments that would count as either “links” or “coincidences” in Gelman’s account). But they also convey important information about the transformation of individual characters, and their transit with respect to other conditions beyond the social: geographical and economic movement, maturation from youth to age, etc. And narratives, through their invocation of generic history, constantly invoke links to modes of thought and histories of representation that, in some sense, exceed the network of the novel and even the network of London at any given time.

Gelman himself brings up another kind of sampling in his paper that I think provides a better way of thinking about how Dickens attempts to get at larger social structures, something he terms “fractal sampling“:

When you do a survey, you want to learn at all levels. For example, if you’re studying politics, you’ll want to know what’s happening nationally, you’ll want a nationally representative sample. But you’ll also want to know what’s happening at the state level, the city level, and the neighborhood level. You can’t expect to get good estimates for all the neighborhoods in the country or all the cities or even all the states, but you’ll want some information at all these levels. That’s what fractal sampling is all about.

Basically, the point is that you can change the sampling methodology in order to capture specific kinds of group & scalar structure. I think this is a better description of what Dickens’s novels do. For a given social question (configured through a specific subset (or sub network?) within the larger world), each novel seems to seek out representative constellations of character that capture the key groups that operate within that network. So, to return to Bleak House, the key problem seems to have to do with poverty and responsibility, as configured by different social & class postions within the city, and as they interplay with legal, administrative, religious, medical, and domestic networks. And if we go back to Bleak House’s famous question, it basically samples along those lines: a country house, a townhome, a servant (the “Mercury in powder”), a street sweeping urchin (“Joe”), a metaphysical visitation (the “distant ray of light”), the dirty churchyard step. I used to read this as an open question that assembled a more or less random collection to pose in extremis the problem of connection that underwrites all of what Henry James would later term “loose, baggy monsters.” Now I think there’s a fair case to be made that the question embeds a set of structural relations that underwrite the fractal sampling of a wider network of encounters: country and city, estate and town, servant (and master), poor and rich, church (and the secular government that will user Jo from that stoop), and worldly infrastructure (the stone of the church) in its tenuous possible connection to divine revelation (the light from above).

Of course, as Jonathan Grossman has taught us, there are lots of different kinds of networks in Dickens’s novels. But it’s interesting to think about how single sentences, “What connexion can there be…,” can be important nodes in bringing them together and suggesting their analogies.

Gephi Network Visualization of Humphry Clinker

I’m still working on slides for my talk at the MLA on Stevenson and Oliphant, and Victorian reflections on the ’45 (force-directed network and Google map visualizations here and here). I’m also starting to experiment with Gephi, a powerful open source graph editor. I was blown away by Matthew Jocker’s “Nineteenth-Century Literary Genome” animation, and wanted to know how it was made. Apparently, they produced it one frame at a time as separate png files and then assembled them using Quicktime.

I’m still trying to figure out how to produce animations, but I like working in Gephi. It has a feature-rich interface and allows you to edit and remove nodes, perform clustering and various forms of network analysis easily and produces sharp images. Here is the location entity network from Humphry Clinker (1771), arranged into eight clusters, with nodes and edges colored by group:

Gephi makes beautiful static images, and as can be seen in genome video, beautiful animations. On the other hand, unlike the Protovis graphs, finished visualizations are not dynamic or interactive. You can’t output a script-based visualization that the user can play with, or that could be embedded in a presentation. Not a problem for a presentation, really, but I like the activity that a Protovis graph can bring to web publishing.

I’m also evaluating these various visualization approaches in order to prepare for my historical fiction and fantasy seminar next semester, which will ask the students to help produce an online textual exhibit using Omeka. I’m going to ask them to look at what’s possible and then pitch paratextual visualizations & tools to package with the exhibit.

Google Fusion Tables

I’m still working on my Stevenson and Oliphant talk for the MLA, and I thought I’d try to map some of the location data that I’ve been collecting for that talk. My friend Mitch Fraas, a Bolinger Fellow here at Penn, has been using fusion tables to look at the geographic distribution of printed books from the records at Van Pelt. Basically, all you have to do is upload the location data as a table to Google Docs and select the visualization that you want. You can embed the visualization directly, or produce a Google Earth view that adds geographic images. I’ve done both below for Smollett’s Humphrey Clinker. The interesting thing about such a visualization is that it helps to highlight the different imaginary spaces of geography. On the one hand, there are the physical locations. On the other, you can use tools like network analysis to figure out how closely associated those places are in the world of the book. (Zoom & click on icons for count. Count number is distinguished by color.)

Locations in Humphry Clinker (1771); Google Fusion Table Map

Differences between geomapping and other location-based visualizations can help to demonstrate how literary networks distort the geographic spaces of the novel. For instance, in the force-directed network graph at the bottom of this earlier post, Edinburgh is closer to England, and Scotland closer to France, due to the close proximity of these locations in terms of their citation in the novel.