A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis

We test the proposition that the number of movies in the same calendar year combining strippers and a specific subgenre cannot exceed one without the genre being fatally destroyed in popularity. We name this the CRwM Stripper Genre Collision Hypothesis after its originator, CRwM, writing at And Now The Screaming Starts. Simple computational analysis of the wikipedia English language corpus provides some, non-definitive, support.

There’s No Computational Sociology In The Champagne Room

CRwM writes:

When no systematic approach is available, then the best you can do is pick an arbitrary point and map out the trajectory of the waning trend from your specific frame of reference. So long as you admit your frame of reference is singular, you’ll allow others, each observing the same phenomenon from their distinct frames of reference, to make whatever calculations to sync up the observations. Applying this pragmatic solution to the problem, I’m defining my frame of reference thusly: A trend is creatively bankrupt when, within a single year, you get two films mashing-up the trend with strippers. — ibid

Though there may indeed be no objective measure for trend fizzlation, we can perhaps be more systematic in our investigation of any given frame of reference. Our approach is to leverage techniques from computational sociology to brute force search a partially structured film description corpus. Though IMDB is probably the best data source readily available to the public, its USD $15000 license fee means using it will have to wait until my long awaited funding from the Institute of Piss Farting About comes through. In the mean time, wikipedia is actually a pretty decent source. Though the English language version does not have great coverage on, e.g., foreign films, I would hypothesise that the gaps are congruent with the focus on trendspotting in this exercise. Wikipedia is, in the words of Bruce Sterling, a kind of common sense engine. It is suited to sampling what we think we know.

Though the approach might be naive enough to be described as folk computational sociology, I prefer to think of it as punk rock.

Spins A Web, Any Size

Though there is a very active tools community around wikipedia, most of it seems to be focused on productivity scripts for editors. Things like auto classification and flagging scripts are popular, and no doubt very useful to the editorial group, if the robot history on my very occassional contributions are any guide. The search toolset seemed from a brief google literature search to be either very simple and widely available (use google to hit a single page) or sophisticated papers of vector based searches implemented on the server side. Our cinematic exploration seemed to fall between these extremes.

My first crack at getting movie data out of wikipedia was to hit the film category page for a year and script a primitive web spider to suck down all the data from that starting point. The top entry on stack overflow also happens to suggest this.

Though I did get this on the way to working, and it can be seen as movieSpider.py at the github repo mentioned later, it’s a lousy approach. Not only do you have to tool about with faked headers because wikipedia doesn’t really want you to do this, you hit the same pages over and again while troubleshooting. You have to deal with the relatively unstructured format of HTML with embedded tags, which implies bucketfuls of heuristics to pull out anything meaningful. Plus if you get it working, you will want to expand the time range, and end up downloading a fair chunk of wikipedia anyay.

Takeaway Corpi

It turns out that wikipedia hosts backups of its entire database in convenient xml export formats. This includes partition by language and current version archives (without all the history and discussion). These data dumps are available here. At a couple of gig, compressed, even the fairly pathetic caps and bandwidth rates of say, Australian broadband, can deal with it in a day or so while you amuse yourself playing badminton. Once uncompressed, a recent version of English language wikipedia takes up around 30 Gb, or in other words, can fit on an iPhone 4.

Once available locally, running searches is quicker, particularly while debugging a script. Extracting a subset also becomes much simpler. Pros seem to rebuild the entire database, including indexes. Indexes didn’t seem much use to me here, as I was hitting the full content of a page, but maybe I’ve underestimated the power of the word indexing in a basic local database. At any rate, the structured data was sufficient for myself.

A short python script of a few hundred lines lets us pull out a particular subset of wikipedia according to a regular expression run as a search on the page content. If there is a hit, we save the entire page. This is found in movie.py and available at github. Building the subset file takes about seven hours on my machine. Using the regex [Category:[0-9]??? films], we can pull in any page that mentions films of a particular year. The resultant subset is a decent film corpus weighing in at a trim 292 Mb.

This same script can be used with minor modification for searching other spaces that attract wikipedia editors of a particularly pedantic and taxonomical breed. Their painstaking sifting of the world into categories is what makes tricks like this possible. You could, for instance, use it to build a wiki subset of military battles with a regex like [Category:Battles involving.*].

Once the subset database is built, we can run a similar expression search across it, but one aware of the structure of film pages – that they have a title, and a category indicating a year. We can therefore attempt a quantitative validation of the search CRwM did by pure pop culture brainpower:

$ python movie.py -e stripper -e zombie

The result of this is

Zombie Strippers -- 2008 Kiss the Bride (2008 film) -- 2008 # false positive Zombies! Zombies! Zombies! -- 2007 I Am Virgin -- 2010 Big Tits Zombie -- 2010 Can't Hardly Wait -- 1998 # song by White Zombie on soundtrack The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies -- 1964 Hide and Creep -- 2004 end 8 Hits: 8 Scanned: 60757

This is not a fully automated process – as I have annotated above, Kiss The Bride, though it possibly would have been enriched by either zombies or strippers, has neither. A review of Zombie Strippers is instead cited in its footnotes. Similarly this brute force text search is ignorant of synonyms – any great zombie burlesque films of the 1920s are liable to be skimmed over without comment.

We also find that the editorial consensus at wikipedia disagrees with CRwM on one crucial point – it asserts Zombie Strippers and Zombies! Zombies! Zombies! were not made in the same calendar year.

Applications

Though the script should be a productivity boost to film subgenre scholars, it still requires a great deal of human insight to make its results valuable. It works best with very concrete and widely recognized subgenre identifiers. Any more complex critical viewpoint is obscured by the lack of a shared jargon. For instance, CRwM’s example of From Dusk till Dawn as being a pomo deadpan crime flick is hardly controversial, but insufficiently universal to appear in wikipedia entries across the subgenre.

Some other notable datapoints:

There are no stripper werewolf films. The blue movie scene near the end of American Werewolf In London is insufficiently focused to qualify.
This technique confirms no vampire stripper collision counts exceeding one in a calendar year.
Two films in 1964 were both musicals and featured strippers: Robin and the 7 Hoods and the aforementioned and new to the author The Incredibly Strange Creatures Who Stopped Living and Became Mixed-Up Zombies. Since The Sound of Music appeared in 1965, and the genre survived at least until the popularity of Cabaret, we posit that either Robin contains insufficient stripper content to qualify as a stripper musical, or that musicals, as a full-blown genre, are outside the scope of the CRwM Stripper Genre Collision Hypothesis.

Though we believe our results should be repeatable, keeping in mind the central role of a critical human eye in this endeavour, for the convenience of those cinephiles who are interested in the output, but not technically inclined, a sorted listing of every stripper film in wikipedia is provided. This paragraph seems a fair bit creepier now than when we first thought of it.

Conclusion and future work

Searching for “stripper zombie” on wikipedia yields 108 results of varying quality. Using the techniques above this can be narrowed to six films. A film subset database built from an expression seems like something someone else could use. Say, to pitch a zombie werewolf stripper musical. Perhaps one that’s incredibly mixed up.

The two-stripper-flicks-a-year thing isn’t meant as a value judgment. It’s simply a law of the universe. — TNCITCM, you know, the article this whole post is about.

2 thoughts on “A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis”

CRwM

April 23, 2011 at 1:10 am

This is amazing. Absolutely amazing. I like to engage in these occasional thought experiments sometimes, but you’ve brought the game up to a whole other level. Thanks for this.

- Adam
  
  April 23, 2011 at 2:53 am
  
  Hey – The original post made my day. Just happy to reciprocate.

Conflated Automatons

Adam rambles

A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis

2 thoughts on “A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “A Quantitative Exploration of the CRwM Stripper Genre Collision Hypothesis”

Leave a comment Cancel reply