Brian Burke has made available play by play data from 2002 to 2010 here, and it’s available as .CSV files. The files are actually pretty small, about 5 megs for a year’s worth of data. CSV is a convenient format, and the data themselves are well enough organized an Excel or OpenOffice junkie can use the product, and so can those of us who work with SQL databases. The advantage of a SQL database is the query language you inherit. And what we’re going to show is how to embed Brian’s data into a small simple SQLite database (see here for M. Richard Hipp’s site, and here for the Wikipedia article).

SQLite is a tiny SQL engine, about 250 kilobytes in size. That’s right, 250 kilobytes. It’s intended to be embedded in applications, and so it doesn’t have the overhead of an Internet service, the way MySQL and Postgres do. It is extensively used in things like browsers (Firefox), mail clients, and internet metrics applications (Unica’s Nettracker). It has an MIT open source license, and  there are commercial versions of this free product you can buy, if you’re into that thing. Oracle, among others, sells a commercial derivative of this free product.

A SQLite database is a single file, so once you create it,  you could move the file onto a USB stick and carry it around with you (or keep it on your Android phone). The database that results is about 55 megabytes in size, not much different in size from the cumulative .CSVs themselves.

Brian’s data lack a primary key, which is fine for spreadsheets, but creates issues in managing walks through sequential data in a database. We’ll create a schema file (we’ll call it schema.sql) as so:

Use a text editor to create it. With the sqlite3 binary, create a database by saying:


sqlite3 pbp.db
sqlite>.read schema.sql
sqlite>.tables
nfl_pbp
sqlite>.exit

Once that’s all done, we’ll use Perl and the DBI module to load these data into our SQLite table. Loading is fast so long as you handle the transaction as a single unit, with the $dbh->begin_work and $dbh->commit statements.

Once loaded, you can begin using the data almost immediately:

sqlite> select count(*) from nfl_pbp;
384809
sqlite> select distinct season from nfl_pbp;
2002
2003
2004
2005
2006
2007
2008
2009
2010
sqlite> select count( distinct gameid ) from nfl_pbp;
2381

As far as the data themselves go, I’ll warn you that the ydline field is a little “lazy”,  in that if you score a touchdown from the 20, the extra point play and the ensuing kick also “occur” on the 20. So you end up with interesting sql statements like this when you search the data:


sqlite> select count(*) from nfl_pbp where ydline = 1 and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
3370
sqlite> select count(*) from nfl_pbp where ydline = 1 and description like "%touchdown%" and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
1690

Using the DBI module, or whatever database interface your language supports, you can soon start crunching data towards game outcome probabilities in no time.

Back in the bad old days, if we wanted data sets for some football analysis, we typed them in ourselves. Later, and perhaps somewhat smarter, we find out that there are tools called spiders that we can use to scrape data off web sites and then put into spreadsheets or databases. I have an example of such a web tool here.

Later we find that people change their web sites routinely, that they use java and javascript to hide the data, that it’s no longer part of the static HTML at all. Part of this new usage is driven by advertising: the people putting up the web site want to know there is a human looking at their stuff, and not a machine.

Sure would be nice if people would simply supply football data in a machine readable form, wouldn’t it? Then you could get some of the advantages Jon Udall speaks about in his article, “Data should be free.

First, obviously, you need data. Then, more interestingly, you need to figure out ways for people to create, share, and collaboratively refine interpretations of the data…. Where else can you find data for these kinds of tools and services to chew on?

Yes, if multiple eyes can look at a single data set, then  you can also take advantage of  the “Cathedral and Bazaar” effect, which suggests that almost any problem becomes easy if enough eyes look at it.

Now, if you’re more the pay for it sort, there are at least three good sources I suggest you look at, and another I’ve found recently that seems intriguing. The three are Football Outsiders, Pro Football Focus, and Advanced NFL Stats. Then there is NFL Data, a web site that appears to be a kind of data reseller. Their FAQ is here.

The truth is,  the business of selling NFL data is a big one. Jaime  Spacco, who in 2001 put up an interesting data analysis presentation, has this to say about NFL data online:

My Dataset is NFL football data for the 2000 season that ended in January, 2001. I gathered the data from ESPN.com and from NFL.com. Statistics for previous seasons are not readily available in digital form, and often are not available free-of-charge. This seems to be because gamblers and fantasy football enthusiasts will pay quite a lot of money for this type of information.

This, of course, was in a relatively innocent period of Internet usage.

Checking the internet, this Infochimps article really only shows one data set of interest, from Football Outsiders, and it costs $30.00 to buy. There are a number of stalled attempts at group projects to create the Great All Encompassing Football Data Set. One such attempt, which lasted for one season, is here.

One of the more intriguing posts is yet another attempt to bring people together for an ambitious data project, and it was posted here. The important info in this link comes from the replies, which actually gives some really good looking data sets.

This leads to the best downloadable data set I can locate, the old Pro Football  Reference data set. They abandoned doing their own and now have a data feed from ESPN. But their old data are available, as a starting point.

Update: a more modern view of this whole topic is provided in this later article here.

Follow

Get every new post delivered to your Inbox.

Join 245 other followers