Perhaps the most important new thing I note is that Pro Football Reference now has play by play data, and ways to display those data as a CSV format. Creating parsers for the data would be work, but that means that advanced stats are now accessible to the average fan.

In Ubuntu 16.04, PDL::Stats is now a standard Ubuntu package and so the standard PDL installation can be used with my scripts. About the only thing you need to use CPAN for, at this point, is installing Sport::Analytics::SimpleRanking.

At work I use a lot of Python these days. I have not had time to rethink all this into Pythonese. But I’m curious, as the curve fitting tools in Python are better/different than those in Perl.

Football diagrams: Although the Perl module Graphics::Magick isn’t a part of CPAN, graphicsmagick and libgraphics-magick-perl are part of the Ubuntu repositories.


The Stathead blog is now defunct and so, evidently, is the Pro Football Reference blog. I’m not too sure what “business decision” led to that action, but it does mean one of the more neutral and popular meeting grounds for football analytics folks is now gone. It also means that Joe Reader has even less of a chance of understanding any particular change in PFR. Chase Stuart of PFR is now posting on Chris Brown’s blog, Smart Football.

The author of the Armchair Analysis blog, Jeff Cross, has tweeted me telling me that a new play by play data set is available, which he says is larger than that of Brian Burke.

Early T formations, or not?

Currently the Wikipedia is claiming that Bernie Bierman of the University of Minnesota was a T formation aficionado

U Minnesota ran the T in the 1930s? Really?

I’ve been doing my best to confirm or deny that. I ordered a couple books..

No mention of Bernie's T in this book.

I've skimmed this book, and haven't seen any diagrams with the T or any long discussion of the T formation. There are a lot of unbalanced single wing diagrams, though.

I also wrote Coach Hugh Wyatt, who sent me two nice letters, both of which state that Coach Bierman was a true blue single wing guy. In his book, “Winning Football”, I have yet to find any mention of the T, and in Rick Moore’s “University of Minnesota Football Vault”, there is no mention of Bernie’s T either.

I suspect an overzealous Wikipedia editor had a hand in that one. Given that Bud Wilkinson was one of Bernie’s players, a biography of Bud Wilkinson could be checked to see if the T formation was really the University of Minnesota’s major weapon.

Brian Burke has made available play by play data from 2002 to 2010 here, and it’s available as .CSV files. The files are actually pretty small, about 5 megs for a year’s worth of data. CSV is a convenient format, and the data themselves are well enough organized an Excel or OpenOffice junkie can use the product, and so can those of us who work with SQL databases. The advantage of a SQL database is the query language you inherit. And what we’re going to show is how to embed Brian’s data into a small simple SQLite database (see here for M. Richard Hipp’s site, and here for the Wikipedia article).

SQLite is a tiny SQL engine, about 250 kilobytes in size. That’s right, 250 kilobytes. It’s intended to be embedded in applications, and so it doesn’t have the overhead of an Internet service, the way MySQL and Postgres do. It is extensively used in things like browsers (Firefox), mail clients, and internet metrics applications (Unica’s Nettracker). It has an MIT open source license, and  there are commercial versions of this free product you can buy, if you’re into that thing. Oracle, among others, sells a commercial derivative of this free product.

A SQLite database is a single file, so once you create it,  you could move the file onto a USB stick and carry it around with you (or keep it on your Android phone). The database that results is about 55 megabytes in size, not much different in size from the cumulative .CSVs themselves.

Brian’s data lack a primary key, which is fine for spreadsheets, but creates issues in managing walks through sequential data in a database. We’ll create a schema file (we’ll call it schema.sql) as so:

Use a text editor to create it. With the sqlite3 binary, create a database by saying:

sqlite3 pbp.db
sqlite>.read schema.sql

Once that’s all done, we’ll use Perl and the DBI module to load these data into our SQLite table. Loading is fast so long as you handle the transaction as a single unit, with the $dbh->begin_work and $dbh->commit statements.

Once loaded, you can begin using the data almost immediately:

sqlite> select count(*) from nfl_pbp;
sqlite> select distinct season from nfl_pbp;
sqlite> select count( distinct gameid ) from nfl_pbp;

As far as the data themselves go, I’ll warn you that the ydline field is a little “lazy”,  in that if you score a touchdown from the 20, the extra point play and the ensuing kick also “occur” on the 20. So you end up with interesting sql statements like this when you search the data:

sqlite> select count(*) from nfl_pbp where ydline = 1 and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
sqlite> select count(*) from nfl_pbp where ydline = 1 and description like "%touchdown%" and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";

Using the DBI module, or whatever database interface your language supports, you can soon start crunching data towards game outcome probabilities in no time.