A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis

We test the proposition that the number of movies in the same calendar year combining strippers and a specific subgenre cannot exceed one without the genre being fatally destroyed in popularity. We name this the CRwM Stripper Genre Collision Hypothesis after its originator, CRwM, writing at And Now The Screaming Starts. Simple computational analysis of the wikipedia English language corpus provides some, non-definitive, support.

There’s No Computational Sociology In The Champagne Room

CRwM writes:

When no systematic approach is available, then the best you can do is pick an arbitrary point and map out the trajectory of the waning trend from your specific frame of reference. So long as you admit your frame of reference is singular, you’ll allow others, each observing the same phenomenon from their distinct frames of reference, to make whatever calculations to sync up the observations. Applying this pragmatic solution to the problem, I’m defining my frame of reference thusly: A trend is creatively bankrupt when, within a single year, you get two films mashing-up the trend with strippers. — ibid

Though there may indeed be no objective measure for trend fizzlation, we can perhaps be more systematic in our investigation of any given frame of reference. Our approach is to leverage techniques from computational sociology to brute force search a partially structured film description corpus. Though IMDB is probably the best data source readily available to the public, its USD $15000 license fee means using it will have to wait until my long awaited funding from the Institute of Piss Farting About comes through. In the mean time, wikipedia is actually a pretty decent source. Though the English language version does not have great coverage on, e.g., foreign films, I would hypothesise that the gaps are congruent with the focus on trendspotting in this exercise. Wikipedia is, in the words of Bruce Sterling, a kind of common sense engine. It is suited to sampling what we think we know.

Though the approach might be naive enough to be described as folk computational sociology, I prefer to think of it as punk rock.

Spins A Web, Any Size

Though there is a very active tools community around wikipedia, most of it seems to be focused on productivity scripts for editors. Things like auto classification and flagging scripts are popular, and no doubt very useful to the editorial group, if the robot history on my very occassional contributions are any guide. The search toolset seemed from a brief google literature search to be either very simple and widely available (use google to hit a single page) or sophisticated papers of vector based searches implemented on the server side. Our cinematic exploration seemed to fall between these extremes.

My first crack at getting movie data out of wikipedia was to hit the film category page for a year and script a primitive web spider to suck down all the data from that starting point. The top entry on stack overflow also happens to suggest this.

Though I did get this on the way to working, and it can be seen as movieSpider.py at the github repo mentioned later, it’s a lousy approach. Not only do you have to tool about with faked headers because wikipedia doesn’t really want you to do this, you hit the same pages over and again while troubleshooting. You have to deal with the relatively unstructured format of HTML with embedded tags, which implies bucketfuls of heuristics to pull out anything meaningful. Plus if you get it working, you will want to expand the time range, and end up downloading a fair chunk of wikipedia anyay.

Takeaway Corpi

It turns out that wikipedia hosts backups of its entire database in convenient xml export formats. This includes partition by language and current version archives (without all the history and discussion). These data dumps are available here. At a couple of gig, compressed, even the fairly pathetic caps and bandwidth rates of say, Australian broadband, can deal with it in a day or so while you amuse yourself playing badminton. Once uncompressed, a recent version of English language wikipedia takes up around 30 Gb, or in other words, can fit on an iPhone 4.

Once available locally, running searches is quicker, particularly while debugging a script. Extracting a subset also becomes much simpler. Pros seem to rebuild the entire database, including indexes. Indexes didn’t seem much use to me here, as I was hitting the full content of a page, but maybe I’ve underestimated the power of the word indexing in a basic local database. At any rate, the structured data was sufficient for myself.

A short python script of a few hundred lines lets us pull out a particular subset of wikipedia according to a regular expression run as a search on the page content. If there is a hit, we save the entire page. This is found in movie.py and available at github. Building the subset file takes about seven hours on my machine. Using the regex [Category:[0-9]??? films], we can pull in any page that mentions films of a particular year. The resultant subset is a decent film corpus weighing in at a trim 292 Mb.

This same script can be used with minor modification for searching other spaces that attract wikipedia editors of a particularly pedantic and taxonomical breed. Their painstaking sifting of the world into categories is what makes tricks like this possible. You could, for instance, use it to build a wiki subset of military battles with a regex like [Category:Battles involving.*].

Once the subset database is built, we can run a similar expression search across it, but one aware of the structure of film pages – that they have a title, and a category indicating a year. We can therefore attempt a quantitative validation of the search CRwM did by pure pop culture brainpower:

$ python movie.py -e stripper -e zombie

The result of this is

Zombie Strippers -- 2008
Kiss the Bride (2008 film) -- 2008 # false positive
Zombies! Zombies! Zombies! -- 2007
I Am Virgin -- 2010
Big Tits Zombie -- 2010
Can't Hardly Wait -- 1998 # song by White Zombie on soundtrack
The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies -- 1964
Hide and Creep -- 2004
end 8
Hits: 8 Scanned: 60757

This is not a fully automated process – as I have annotated above, Kiss The Bride, though it possibly would have been enriched by either zombies or strippers, has neither. A review of Zombie Strippers is instead cited in its footnotes. Similarly this brute force text search is ignorant of synonyms – any great zombie burlesque films of the 1920s are liable to be skimmed over without comment.

We also find that the editorial consensus at wikipedia disagrees with CRwM on one crucial point – it asserts Zombie Strippers and Zombies! Zombies! Zombies! were not made in the same calendar year.

Applications

Though the script should be a productivity boost to film subgenre scholars, it still requires a great deal of human insight to make its results valuable. It works best with very concrete and widely recognized subgenre identifiers. Any more complex critical viewpoint is obscured by the lack of a shared jargon. For instance, CRwM’s example of From Dusk till Dawn as being a pomo deadpan crime flick is hardly controversial, but insufficiently universal to appear in wikipedia entries across the subgenre.

Some other notable datapoints:

  • There are no stripper werewolf films. The blue movie scene near the end of American Werewolf In London is insufficiently focused to qualify.
  • This technique confirms no vampire stripper collision counts exceeding one in a calendar year.
  • Two films in 1964 were both musicals and featured strippers: Robin and the 7 Hoods and the aforementioned and new to the author The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies. Since The Sound of Music appeared in 1965, and the genre survived at least until the popularity of Cabaret, we posit that either Robin contains insufficient stripper content to qualify as a stripper musical, or that musicals, as a full-blown genre, are outside the scope of the CRwM Stripper Genre Collision Hypothesis.

Though we believe our results should be repeatable, keeping in mind the central role of a critical human eye in this endeavour, for the convenience of those cinephiles who are interested in the output, but not technically inclined, a sorted listing of every stripper film in wikipedia is provided. This paragraph seems a fair bit creepier now than when we first thought of it.

Conclusion and future work

Searching for “stripper zombie” on wikipedia yields 108 results of varying quality. Using the techniques above this can be narrowed to six films. A film subset database built from an expression seems like something someone else could use. Say, to pitch a zombie werewolf stripper musical. Perhaps one that’s incredibly mixed up.

The two-stripper-flicks-a-year thing isn’t meant as a value judgment. It’s simply a law of the universe. — TNCITCM, you know, the article this whole post is about.

XMORPG

Is there a market for a premum multiplayer online roleplaying game? Not massively multiplayer – exclusively multiplayer?

Consider it. You log straight in – no queues because it’s not oversubscribed. You notice there’s another mate online, but you want to put together a party to tackle an interesting new dungeon in Erewhat that’s just become available. You put your pick up group (PUG) note out, and while away five or ten minutes at the tavern or shophouse on the inevitable logistics of inventory management and planning that accompanies all RPGs. Suddenly your character’s sleeve is tugged by a young man. It’s an NPC, Bob the messenger. “Sir, the village of Erewhat is under assault! We have a team of heroes going to save it, will you aid us?” As usual, the adventure concierge is being played by a real person, and he feeds you context on the way, as you converse with him. Or in the case of team thump, when you are missing a really big clue. Many NPCs are from the English speaking developing world: it’s a growing job market in Manila and Mumbai. Robots are only used when the admins are particularly busy putting out fires somewhere.

You join the three other players outside the village and soon charge into the fray, throwing off waves of zombie skeleton wombats. You lose track of your young messenger Bob – did he die or slip away to other great deeds? Never mind, there are baddies to kill, and you happily hew your way through the dungeon. It’s good fun, but nothing you haven’t done a thousand times before, to be honest.

Then – wait. Was that arrow from behind? You swing around. A swarm of ninja koalas have snuck behind you, and have nearly sliced the squishies into mince already. It’s an ambush. Maybe even encirclement. Smeg! You’ve opted to lose gear and XP on death for a discount in fees and the extra pang of having skin in the game. You fall back to a doorway, making sure the priest is ok. Behind the scenes, and from the conversation with the messenger, a DM has noticed you’re an experienced band and would relish more of a challenge. As in NWN, roaming DMs might jump into a dungeon and make it more interesting. Unlike in NWN, they are paid staff.

You dig deep, and you’re holding the monsters at bay, but you’re in lousy shape. Some of the party weren’t quite as experienced as they said. Usually this would mean they die and respawn. Today’s a bit unusual though. “Hail fellows, well met!” – it’s Bob. He shouts out from behind you. It turns out you fell back into a mansion house, and Bob knows a secret tunnel. If their hadn’t been a tunnel, he wouldn’t have intervened, but this makes for an interesting moment. Barring the door for a minute to keep out the monsters, you slip into the tunnel, a rest area, and the start of a new underground zone …

Maybe my description is off – I’ve only newbed around on a few free online RPGs and never really got bitten by the bug. At the moment, though, WoW is so huge it’s necessarily a mass consumer product. It’s the McDonald’s of MMOs. And though I did eat McDonald’s when drunk the other night, if I want a nice meal out, I pay a bit more to go to a nice restaurant where I can enjoy the food and the social setting. It’s been noted many times that gamers are on average 30+ years old with stable lives and incomes. They’re only getting older and richer. Whether you like the class implications or not, the age of the online country club is coming our way.

Confucian Software For The Impatient

For those who have found themselves intrigued by the conceptual parallels between a group of underemployed, bookish Han Chinese civil servants from two and a half thousand years ago and the electrically engineered plumbing of our contemporary information age, but find my treatment of it rather longwinded, not to mention sullied somewhat by a Dickensian tendency to long, single-paragraph sentences with interstitial tangents, this short presentation by nein geist on Confucian Philosophy in Software Design may be more to taste.

Bayeux and Information History

Chris Mellor at El Reg has a nice digestible crumpet of a piece considering the Bayeux tapestry as a storage device.

It’s a member of a genre I think of as Information History. By analogy with the field of economic history, information history studies the information environment of past societies, using later scholarly techniques to analyze it and its impacts. Economic history sometimes gives us rich context, to estimate, say, grocery or property prices in ancient Rome, and give us knowledge the Romans had but we do not. It can also throw up startling and beautiful perspectives on the past and the future that have not been explored systematically before, as in this classic paper, The Colonial Origins of Comparative Development:

We exploit differences in the mortality rates faced by European colonialists to estimate the effect of institutions on economic performance. Our argument is that Europeans adopted very different colonization policies in different colonies, with different associated institutions. The choice of colonization strategy was, at least in part, determined by whether Europeans could settle in the colony.

Acemoglu et al here use sophisticated stats techniques and painstaking colonial scholarship to accumulate evidence for their thesis.

Mellor wrote a newspaper article, not a paper, and we shouldn’t hold him to a scholarly standard he didn’t claim. Even given that, what distinguishes it as a document of information history, rather than just a neat analogy, is that he calculates the read / write rates. (2.168 bytes/per hour write speed, on the back of Mellor’s envelope.) He puts it in systematic terms only given meaning in the last fifty years or so, and thereby enriches our understanding of their time, and ours.

XIII.3 Name Oriented Software Development

子路日:卫君待子而为政,子将奚先.子日:必也正名乎. 子路日:有是哉!子之迂也.奚其正.子日:野哉由也.君子于其所不知.蓋阙如也.名不正,则言不顺.言不顺,则事不成.事不成,则礼乐不兴.礼乐不兴,则刑罚不中.刑罚不中,则民无所措手足.故君子名知必可言也.言之必可行也.君子于其言,无所苟而已矣. — 论语 十三:三

Tzu-lu said, ‘If the Lord of Wei left the administration (cheng) of his state to you, what would you put first?’ The Master said, ‘If something has to be put first, it is, perhaps, the rectification (cheng) of names.’ Tzu-lu said, ‘Is that so? What a roundabout way you take! Why bring rectification in at all?’ The Master said, ‘Yu, how boorish you are. Where a gentleman is ignorant, one would expect him not to offer any opinion. When names are not correct, what is said will not sound reasonable; when what is said does not sound reasonable, affairs will not culminate in success; when affairs do not culminate in success, rites and music will not flourish; when rites and music do not flourish, punishments will not fit the crimes; when punishments do not fit the crimes, the common people will not know where to put hand and foot. Thus when the gentleman names something, the name is sure to be usable in speech, and when he says something this is sure to be practicable. The thing about the gentleman is that he is anything but casual where speech is concerned.’ — Analects XIII.3 (Lau)

If something has to be put first in programming, it is, perhaps, the rectification of names. Names are what place the system and the not-system in the same reality. They are the bridge between the externalized machine without use and the internalized emotion without expression.

What in Confucian philsophy we call the rectification of names we can in Confucian software call Name Oriented Software Development. This does not yet exist, but it can be defined simply. Name Oriented Software Development uses a toolset that promotes the continuous rectification of names across all the interstices between natural languages and machine languages in the system.

The nominalist toolset should span package, class, variable and method names. It should span library and project names. It should cover layers of code around the core system, including message protocol definitions and protocol dictionaries, configuration entities, unit and performance tests, and run scripts. The aim in managing machine facing names is to enforce consistency and coherence of names while making precision in naming, including type restriction, easy. The same name appearing in different machine-facing contexts, eg a message protocol field in a script and a Java class, should be linked in an automated and machine verified way. This might at first seem to make the internal technical dialect (namespace, one could say, if it were not taken) too rigid. This is not the intent, and if we look at the rename variable refactoring, not its effect in smaller scopes. It is because we can execute refactorings in a reliable, automatic way that it becomes viable as a low-risk change. This is the same across the entire internal technical dialect – enforced formal consistency allows mutability to be low risk.

By contrast, the aim in managing people facing names is to allow a consensus jargon to emerge backed by a literature of interaction and aspiration which is still moored to the technical reality of the system as it exists. This covers specifications, use cases, test documents (even and especially automated acceptance tests), application messages or labels for users or external parties, internationalization, log messages (ie, a UI for support) support and developer faqs, and user manuals and documentation. Changes to this shared understanding should be reflected as immediately as possible and not tied to a long software release cycle. A label, for example, is essentially a piece of simple key-mapped static data; changes to it can therefore be routine and implicit. When treating this sort of text as static data, it is essential to keep it automatically linked to the running widgets themselves – otherwise it is merely the domain of style guides and moral exhortation. A true dialect is owned by a community – a folksonomy – so the entire community must be able to use and contribute to it. The editors for these documents should be as broad as possible within organisational constraints. Where editing constraints exist (organisational not technical), annotation on these documents should be easy. Cross referencing within the docset and outside it to external sources of expertise (for instance on domain jargon) should also be easy to add and maintain.

Or, via Wittgenstein, Tractatus 7: Whereof one cannot speak clearly, one must damn well link to a wikipedia entry.

This approach owes more than a little to Knuth’s Literate Programming. A distinction is that instead of putting documentation with source code in a single artifact to be owned by a single philosopher-developer auteur, it puts editable, often executable content in the hands of a community of expertise.

Historically, the Confucians became increasingly sophisticated in their theory of names. I know of no record of Confucius addressing names changing over time, apart from a general openness to political reform (Analects IX.3). Indeed, as old Kongzi was often conservative, the quote above might reasonably be taken to advocate reverting to old names now in disuse. Xun Zi (荀子), who lived in the century after Confucius and was one of his major intellectual heirs, saw fit to reuse the Mohist (墨家) theory of names.

单足以喻则单,单不足以喻则兼; […] 名无固宜,约之以命,约定俗成谓之宜,异于约则谓之不宜。名无固实,约之以命实,约定俗成,谓之实名。名有固善,径易而不拂,谓之善名 — 荀子 – 正名 6-8

If a single name is enough to communicate, make it single; if not, combine. […] Names have no inherent appropriateness, we name by convention; when the convention is fixed and the custom established, we call them appropriate, and what differs from the convention we call inappropriate. No object belongs inherently to a name, we name by convention, and when the convention is fixed and the custom established, we call it the object’s name. Names do have inherent goodness; when straightforward, easy and not inconsistent, we call them good names. — Xun Zi – Rectification of Names 6-8 , AC Graham translation

AC Graham’s use of the word “combine” (兼) above is in the sense of “compound”. This is the term the Stanford Encyclopedia of Philosophy uses in preference. That entry points out Xunzi’s concern with names is also in rebuttal to the paradoxes of the Chinese sophists. Furthermore, Xunzi is rather more open to mutability and the pragmatic construction of language from a vernacular folksonomy. (The term 俗成, translated as convention, includes 俗 which now spans meanings including custom and vulgar.)

Computer science launched itself out of the western analytical tradition, though. A sometime description and criticism of Object Oriented design is that it is reheated Platonism. See, e.g., blogs by Vlad Tarko and Richard Farrar. ((Should the frequent light analogical treatment of Platonism and software sound a note of alarm for this very project? Perhaps. But there is heavier academic firepower behind us as well. Are we not grappling with code, which Berry and Pawlik call the defining discourse of our postmodernity? In an article that has more code metaphors than a house full of crack-addled Java-monkeys at teatime.))

The platonic analogy has legs. We do construct a parallel world of sorts in code. It is also revealing to think of classes as ideal representations of some external physical element. Isn’t that why we often fail as programmers? We critique the world for being less perfect than our ideal programs, for failing to match up to their strict conditions, for making them ugly, for crashing them. Code, and OO, can have a kind of brittleness that happens when we stop thinking of software as a model of the world and start thinking of it as the true world. We fit our shape to the name.

My suggestion is not to stop doing taxonomy – we could not navigate the world and construct alternative worlds out of software without it. We couldn’t even speak without it. Instead, we should use Xunzi’s advice and fit our names to the shape. We name by convention. We continually rectify names. If a single name is enough to communicate, make it single. If not, combine.

‘RenameClass’ is the most powerful refactoring. — Michael Feathers