A Program Is Articulate

Rearing its head out of Helen’s corner of the twitter-sphere around the occassion of the great Austrian’s 112th birthday (and sixty years since his death) comes the Tractatus Digito-Philosophicus, a recasting of Wittgenstein’s landmark first book into software terms.

2.0122 […] (It is impossible for words to appear in two different roles: by themselves, and in programs.)

There are several appealing elements to this self-described “odd venture”. One is that the translation is to a degree automatic, based on a simple search and replace table found at the end. It is logical positivism via sed. Another is that the Tractatus was produced during and soon after Wittgenstein was working as an actual engineer – first as an aviation research engineer at Manchester University, and later supervising technicians in a supply depot in World War I. He was not temperamentally very well suited to engineering work. Biographers have traditionally downplayed this as an intellectual influence, though Susan Sterett explores interesting parallels and possible influences around the idea of engineering models in the well-titled and readable Wittgenstein Flies A Kite.

The Tractatus Digito has the virtue of poetry (metaphor, simile, and so on) in presenting the same information from a different perspective and so firing different connections in the brain. But it’s more systematic than poetry as well. It’s not just a martial arts metaphor, as rhetorically useful as they can be. To contrast with an example close to hand, attempting to describe software in Confucian terms is a project fuelled as much by juxtaposition and analogy as correspondence. The mapping to that world will always be a partial one.

Ainsworth, rather, has noticed what every undergraduate programmer should know: that programs are sequences of logical propositions. So Wittgenstein is necessarily writing about software, or perhaps more specfically, because there is no social dimension, about programs. Our thinking about software is intertwined with its origins in the 1920s. This partial recasting is valuable in the same way a Turing Machine simulator is valuable. Sure, some of the resulting sentences don’t really make sense. Yet bringing registers and sorting algorithms into the book that invented truth tables feels less like visiting a foreign land, and more like hearing a friend talk excitedly on their return to the old family home.

3.141

A program is not a blend of instructions. – (Just as a theme in music is not a blend of notes.)

A program is articulate.

A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis

We test the proposition that the number of movies in the same calendar year combining strippers and a specific subgenre cannot exceed one without the genre being fatally destroyed in popularity. We name this the CRwM Stripper Genre Collision Hypothesis after its originator, CRwM, writing at And Now The Screaming Starts. Simple computational analysis of the wikipedia English language corpus provides some, non-definitive, support.

There’s No Computational Sociology In The Champagne Room

CRwM writes:

When no systematic approach is available, then the best you can do is pick an arbitrary point and map out the trajectory of the waning trend from your specific frame of reference. So long as you admit your frame of reference is singular, you’ll allow others, each observing the same phenomenon from their distinct frames of reference, to make whatever calculations to sync up the observations. Applying this pragmatic solution to the problem, I’m defining my frame of reference thusly: A trend is creatively bankrupt when, within a single year, you get two films mashing-up the trend with strippers. — ibid

Though there may indeed be no objective measure for trend fizzlation, we can perhaps be more systematic in our investigation of any given frame of reference. Our approach is to leverage techniques from computational sociology to brute force search a partially structured film description corpus. Though IMDB is probably the best data source readily available to the public, its USD $15000 license fee means using it will have to wait until my long awaited funding from the Institute of Piss Farting About comes through. In the mean time, wikipedia is actually a pretty decent source. Though the English language version does not have great coverage on, e.g., foreign films, I would hypothesise that the gaps are congruent with the focus on trendspotting in this exercise. Wikipedia is, in the words of Bruce Sterling, a kind of common sense engine. It is suited to sampling what we think we know.

Though the approach might be naive enough to be described as folk computational sociology, I prefer to think of it as punk rock.

Spins A Web, Any Size

Though there is a very active tools community around wikipedia, most of it seems to be focused on productivity scripts for editors. Things like auto classification and flagging scripts are popular, and no doubt very useful to the editorial group, if the robot history on my very occassional contributions are any guide. The search toolset seemed from a brief google literature search to be either very simple and widely available (use google to hit a single page) or sophisticated papers of vector based searches implemented on the server side. Our cinematic exploration seemed to fall between these extremes.

My first crack at getting movie data out of wikipedia was to hit the film category page for a year and script a primitive web spider to suck down all the data from that starting point. The top entry on stack overflow also happens to suggest this.

Though I did get this on the way to working, and it can be seen as movieSpider.py at the github repo mentioned later, it’s a lousy approach. Not only do you have to tool about with faked headers because wikipedia doesn’t really want you to do this, you hit the same pages over and again while troubleshooting. You have to deal with the relatively unstructured format of HTML with embedded tags, which implies bucketfuls of heuristics to pull out anything meaningful. Plus if you get it working, you will want to expand the time range, and end up downloading a fair chunk of wikipedia anyay.

Takeaway Corpi

It turns out that wikipedia hosts backups of its entire database in convenient xml export formats. This includes partition by language and current version archives (without all the history and discussion). These data dumps are available here. At a couple of gig, compressed, even the fairly pathetic caps and bandwidth rates of say, Australian broadband, can deal with it in a day or so while you amuse yourself playing badminton. Once uncompressed, a recent version of English language wikipedia takes up around 30 Gb, or in other words, can fit on an iPhone 4.

Once available locally, running searches is quicker, particularly while debugging a script. Extracting a subset also becomes much simpler. Pros seem to rebuild the entire database, including indexes. Indexes didn’t seem much use to me here, as I was hitting the full content of a page, but maybe I’ve underestimated the power of the word indexing in a basic local database. At any rate, the structured data was sufficient for myself.

A short python script of a few hundred lines lets us pull out a particular subset of wikipedia according to a regular expression run as a search on the page content. If there is a hit, we save the entire page. This is found in movie.py and available at github. Building the subset file takes about seven hours on my machine. Using the regex [Category:[0-9]??? films], we can pull in any page that mentions films of a particular year. The resultant subset is a decent film corpus weighing in at a trim 292 Mb.

This same script can be used with minor modification for searching other spaces that attract wikipedia editors of a particularly pedantic and taxonomical breed. Their painstaking sifting of the world into categories is what makes tricks like this possible. You could, for instance, use it to build a wiki subset of military battles with a regex like [Category:Battles involving.*].

Once the subset database is built, we can run a similar expression search across it, but one aware of the structure of film pages – that they have a title, and a category indicating a year. We can therefore attempt a quantitative validation of the search CRwM did by pure pop culture brainpower:

$ python movie.py -e stripper -e zombie

The result of this is

Zombie Strippers -- 2008
Kiss the Bride (2008 film) -- 2008 # false positive
Zombies! Zombies! Zombies! -- 2007
I Am Virgin -- 2010
Big Tits Zombie -- 2010
Can't Hardly Wait -- 1998 # song by White Zombie on soundtrack
The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies -- 1964
Hide and Creep -- 2004
end 8
Hits: 8 Scanned: 60757

This is not a fully automated process – as I have annotated above, Kiss The Bride, though it possibly would have been enriched by either zombies or strippers, has neither. A review of Zombie Strippers is instead cited in its footnotes. Similarly this brute force text search is ignorant of synonyms – any great zombie burlesque films of the 1920s are liable to be skimmed over without comment.

We also find that the editorial consensus at wikipedia disagrees with CRwM on one crucial point – it asserts Zombie Strippers and Zombies! Zombies! Zombies! were not made in the same calendar year.

Applications

Though the script should be a productivity boost to film subgenre scholars, it still requires a great deal of human insight to make its results valuable. It works best with very concrete and widely recognized subgenre identifiers. Any more complex critical viewpoint is obscured by the lack of a shared jargon. For instance, CRwM’s example of From Dusk till Dawn as being a pomo deadpan crime flick is hardly controversial, but insufficiently universal to appear in wikipedia entries across the subgenre.

Some other notable datapoints:

  • There are no stripper werewolf films. The blue movie scene near the end of American Werewolf In London is insufficiently focused to qualify.
  • This technique confirms no vampire stripper collision counts exceeding one in a calendar year.
  • Two films in 1964 were both musicals and featured strippers: Robin and the 7 Hoods and the aforementioned and new to the author The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies. Since The Sound of Music appeared in 1965, and the genre survived at least until the popularity of Cabaret, we posit that either Robin contains insufficient stripper content to qualify as a stripper musical, or that musicals, as a full-blown genre, are outside the scope of the CRwM Stripper Genre Collision Hypothesis.

Though we believe our results should be repeatable, keeping in mind the central role of a critical human eye in this endeavour, for the convenience of those cinephiles who are interested in the output, but not technically inclined, a sorted listing of every stripper film in wikipedia is provided. This paragraph seems a fair bit creepier now than when we first thought of it.

Conclusion and future work

Searching for “stripper zombie” on wikipedia yields 108 results of varying quality. Using the techniques above this can be narrowed to six films. A film subset database built from an expression seems like something someone else could use. Say, to pitch a zombie werewolf stripper musical. Perhaps one that’s incredibly mixed up.

The two-stripper-flicks-a-year thing isn’t meant as a value judgment. It’s simply a law of the universe. — TNCITCM, you know, the article this whole post is about.

XMORPG

Is there a market for a premum multiplayer online roleplaying game? Not massively multiplayer – exclusively multiplayer?

Consider it. You log straight in – no queues because it’s not oversubscribed. You notice there’s another mate online, but you want to put together a party to tackle an interesting new dungeon in Erewhat that’s just become available. You put your pick up group (PUG) note out, and while away five or ten minutes at the tavern or shophouse on the inevitable logistics of inventory management and planning that accompanies all RPGs. Suddenly your character’s sleeve is tugged by a young man. It’s an NPC, Bob the messenger. “Sir, the village of Erewhat is under assault! We have a team of heroes going to save it, will you aid us?” As usual, the adventure concierge is being played by a real person, and he feeds you context on the way, as you converse with him. Or in the case of team thump, when you are missing a really big clue. Many NPCs are from the English speaking developing world: it’s a growing job market in Manila and Mumbai. Robots are only used when the admins are particularly busy putting out fires somewhere.

You join the three other players outside the village and soon charge into the fray, throwing off waves of zombie skeleton wombats. You lose track of your young messenger Bob – did he die or slip away to other great deeds? Never mind, there are baddies to kill, and you happily hew your way through the dungeon. It’s good fun, but nothing you haven’t done a thousand times before, to be honest.

Then – wait. Was that arrow from behind? You swing around. A swarm of ninja koalas have snuck behind you, and have nearly sliced the squishies into mince already. It’s an ambush. Maybe even encirclement. Smeg! You’ve opted to lose gear and XP on death for a discount in fees and the extra pang of having skin in the game. You fall back to a doorway, making sure the priest is ok. Behind the scenes, and from the conversation with the messenger, a DM has noticed you’re an experienced band and would relish more of a challenge. As in NWN, roaming DMs might jump into a dungeon and make it more interesting. Unlike in NWN, they are paid staff.

You dig deep, and you’re holding the monsters at bay, but you’re in lousy shape. Some of the party weren’t quite as experienced as they said. Usually this would mean they die and respawn. Today’s a bit unusual though. “Hail fellows, well met!” – it’s Bob. He shouts out from behind you. It turns out you fell back into a mansion house, and Bob knows a secret tunnel. If their hadn’t been a tunnel, he wouldn’t have intervened, but this makes for an interesting moment. Barring the door for a minute to keep out the monsters, you slip into the tunnel, a rest area, and the start of a new underground zone …

Maybe my description is off – I’ve only newbed around on a few free online RPGs and never really got bitten by the bug. At the moment, though, WoW is so huge it’s necessarily a mass consumer product. It’s the McDonald’s of MMOs. And though I did eat McDonald’s when drunk the other night, if I want a nice meal out, I pay a bit more to go to a nice restaurant where I can enjoy the food and the social setting. It’s been noted many times that gamers are on average 30+ years old with stable lives and incomes. They’re only getting older and richer. Whether you like the class implications or not, the age of the online country club is coming our way.

Confucian Software For The Impatient

For those who have found themselves intrigued by the conceptual parallels between a group of underemployed, bookish Han Chinese civil servants from two and a half thousand years ago and the electrically engineered plumbing of our contemporary information age, but find my treatment of it rather longwinded, not to mention sullied somewhat by a Dickensian tendency to long, single-paragraph sentences with interstitial tangents, this short presentation by nein geist on Confucian Philosophy in Software Design may be more to taste.

Bayeux and Information History

Chris Mellor at El Reg has a nice digestible crumpet of a piece considering the Bayeux tapestry as a storage device.

It’s a member of a genre I think of as Information History. By analogy with the field of economic history, information history studies the information environment of past societies, using later scholarly techniques to analyze it and its impacts. Economic history sometimes gives us rich context, to estimate, say, grocery or property prices in ancient Rome, and give us knowledge the Romans had but we do not. It can also throw up startling and beautiful perspectives on the past and the future that have not been explored systematically before, as in this classic paper, The Colonial Origins of Comparative Development:

We exploit differences in the mortality rates faced by European colonialists to estimate the effect of institutions on economic performance. Our argument is that Europeans adopted very different colonization policies in different colonies, with different associated institutions. The choice of colonization strategy was, at least in part, determined by whether Europeans could settle in the colony.

Acemoglu et al here use sophisticated stats techniques and painstaking colonial scholarship to accumulate evidence for their thesis.

Mellor wrote a newspaper article, not a paper, and we shouldn’t hold him to a scholarly standard he didn’t claim. Even given that, what distinguishes it as a document of information history, rather than just a neat analogy, is that he calculates the read / write rates. (2.168 bytes/per hour write speed, on the back of Mellor’s envelope.) He puts it in systematic terms only given meaning in the last fifty years or so, and thereby enriches our understanding of their time, and ours.