Machine Grading


A friend of mine drew my attention to the NYTimes’ recent article on advanced in essay-grading software. It’s technology that will raise hackles at campuses around the country. The claim is that such programs are becoming sophisticated enough to grade college-level writing. Of course, their effectiveness is widely debated. The article helpfully includes a link to a study by Les Perelman which critiques the data being used to support such claims (he argues that sample size problems, confusion between distinct kinds of essays and grading systems, and loose assertions undermine the argument). The software is getting better, but it still doesn’t look like it can quite replicate the scores produced by human graders.

But such criticism is an argument at the margins. There is now clearly room for debate on both sides. Machines are comparable on standardized tests. The long-term trajectory is evident: if machines are roughly as effective as a force of part-time human graders, standardized tests will end up using the software to save money. They’ll keep some humans in the loop cross checking and validating, but the key incentives all point in the direction of greater automization. The reductive structures and simplistic arguments which we train students to replicate for these tests has laid the groundwork. We’ve already whittled essay writing into an algorithm.

As Stephen Ramsay noted, machines have been reading for some time now. Grading makes sense. And it will produce new opportunities for educators and the massive for-profit side of the education industry. E.g.: there’s bound to be some asymmetry between the putative essay rubric and the algorithm the machines are actually looking for. I don’t imagine it will take too long before test prep groups hack it out and come up with tips to optimize essay performance that have little or nothing to do with what we might think of as strong, analytic writing. But that’s the game.

Things are considerably less certain for college coursework. Setting aside the perceived value for the student to have a human looking over their papers (which is considerable enough), and the challenge of revising (what feedback do you give students for revisions?), it’s hard to see an essay grading software flexible enough to evaluate the field-specific features of writing within the disciplines. The problem only gets harder when you think about variation between individual assignments and between faculty preferences. We’re all looking for a distinct, sometimes widely divergent set of things when we give assignments.  How do we tell the machine to look for them?

Part of the problem is with machine learning generally. I’ve been interested in it for a while. Latent Sematic Analysis (the tool I used to look for analogies in the first edition of On the Origin of Species a few years ago) was already being tested as a system for essay evaluation in the nineties. But such naive machine-learning techniques are a challenge because it’s hard to say precisely what they’re looking for. What are the basic structures and semantic relationships that constitute good writing for a given case? The programs can’t tell us.

Alternatively, structured machine learning approaches suffer from a problem of translation. Imagine there was a program that allowed you to weight the importance of, say, transitions, grammar, number and quality of citations, etc. for the assignment. Our individual sense of what we mean by these things varies widely. There is a ton of implicit judgement involved in grading and commenting. Often, priorities shift while grading. But if we plug these things into a program it’s hard to know exactly how it understands these values and categories. And if we have to cross-check and validate extensively, labor saving goes out the window.

Then there’s the graduate labor problem. I’d prefer to have grad students teaching their own courses in their field with more or less supervision depending on how far advanced they are. But at many schools (including my own), this is difficult to impossible. TAing large courses is sometimes one of the only ways that grad students get experience in courses specific to their degree and get to work with a mentor on pedagogy. Take away the need for grading and it’s going to be that much harder to support graduate programs.

But I hate to make an argument for inefficiency. I can imagine, for instance, courses where machines help grade series of short essay assignments, with some kind of robust feedback structure, lead up to a culminating paper graded by the prof. Or using such machine grading & feedback to run through 3 or 4 rounds of revision, perhaps for distinct qualities, rather than 1 or 2 omnibus redrafts.  Don’t know if the students would like it, but I imagine you’d have better papers at the end of the day.