Perhaps the most important new thing I note is that Pro Football Reference now has play by play data, and ways to display those data as a CSV format. Creating parsers for the data would be work, but that means that advanced stats are now accessible to the average fan.

In Ubuntu 16.04, PDL::Stats is now a standard Ubuntu package and so the standard PDL installation can be used with my scripts. About the only thing you need to use CPAN for, at this point, is installing Sport::Analytics::SimpleRanking.

At work I use a lot of Python these days. I have not had time to rethink all this into Pythonese. But I’m curious, as the curve fitting tools in Python are better/different than those in Perl.

Football diagrams: Although the Perl module Graphics::Magick isn’t a part of CPAN, graphicsmagick and libgraphics-magick-perl are part of the Ubuntu repositories.

Much as in the previous series, we’re going to analyze the playoff prospects of New Orleans and Detroit. We’re also going to post the code (very hacky) that I’ve been using to study playoff teams. The code (2 pics required) is as follows:

Now one thing about this code, because it’s using Getopts::Long, numbers have to be positive or else this code will think that the number is an option. The simple fix is to find  the value of the most negative SOS and add a positive number equal in magnitude to both SOSs. As the only important  value is the difference, this is a valid form of data entry.

Ok, the significant factors, plus Pythagoreans:

Detroit: No playoff exp, Away, SOS = 0.63, Pythagorean 62.9%

New Orleans: Won Super Bowl 2 years ago, Home, SOS = -1.60, Pythagorean 77.7%

Because NO’s SOS is negative, just let it equal zero and add 1.60 to the SOS of Detroit, yielding 2.23. That’s the info you would pump into the calculator above. And it gives you the  following results:

New Orlean’s advantage due to playoff experience alone give NO a 68% chance of winning.

Adding in home field advantage give New Orleans a 76% chance of winning.

Adding in strength of schedule reduces New Orleans chances to 69%. New Orleans is heavily  favored.

By comparison, after all is said and done, had Atlanta been slotted into this game, the playoff calculator gives Atlanta a 51% chance of winning. Atlanta has a slightly better SOS than Detroit, and it also has recent playoff experience.

Given how powerful the New Orleans offense is, should Atlanta have sought out a team with a weaker offense, such as New York? That’s one of the counterintuitive points of my previous playoff analysis. Offensive metrics tend to yield a p of 0.15, not 0.05. They’re suggestive, not etched in stone advantages. New Orleans’ powerful offense may come into  play, but then again, it may not.

Brian Burke has made available play by play data from 2002 to 2010 here, and it’s available as .CSV files. The files are actually pretty small, about 5 megs for a year’s worth of data. CSV is a convenient format, and the data themselves are well enough organized an Excel or OpenOffice junkie can use the product, and so can those of us who work with SQL databases. The advantage of a SQL database is the query language you inherit. And what we’re going to show is how to embed Brian’s data into a small simple SQLite database (see here for M. Richard Hipp’s site, and here for the Wikipedia article).

SQLite is a tiny SQL engine, about 250 kilobytes in size. That’s right, 250 kilobytes. It’s intended to be embedded in applications, and so it doesn’t have the overhead of an Internet service, the way MySQL and Postgres do. It is extensively used in things like browsers (Firefox), mail clients, and internet metrics applications (Unica’s Nettracker). It has an MIT open source license, and  there are commercial versions of this free product you can buy, if you’re into that thing. Oracle, among others, sells a commercial derivative of this free product.

A SQLite database is a single file, so once you create it,  you could move the file onto a USB stick and carry it around with you (or keep it on your Android phone). The database that results is about 55 megabytes in size, not much different in size from the cumulative .CSVs themselves.

Brian’s data lack a primary key, which is fine for spreadsheets, but creates issues in managing walks through sequential data in a database. We’ll create a schema file (we’ll call it schema.sql) as so:

Use a text editor to create it. With the sqlite3 binary, create a database by saying:

``` sqlite3 pbp.db sqlite>.read schema.sql sqlite>.tables nfl_pbp sqlite>.exit ```

Once that’s all done, we’ll use Perl and the DBI module to load these data into our SQLite table. Loading is fast so long as you handle the transaction as a single unit, with the `\$dbh->begin_work` and `\$dbh->commit` statements.

Once loaded, you can begin using the data almost immediately:
``` sqlite> select count(*) from nfl_pbp; 384809 sqlite> select distinct season from nfl_pbp; 2002 2003 2004 2005 2006 2007 2008 2009 2010 sqlite> select count( distinct gameid ) from nfl_pbp; 2381 ```

As far as the data themselves go, I’ll warn you that the ydline field is a little “lazy”,  in that if you score a touchdown from the 20, the extra point play and the ensuing kick also “occur” on the 20. So you end up with interesting sql statements like this when you search the data:

``` sqlite> select count(*) from nfl_pbp where ydline = 1 and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %"; 3370 sqlite> select count(*) from nfl_pbp where ydline = 1 and description like "%touchdown%" and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %"; 1690 ```

Using the DBI module, or whatever database interface your language supports, you can soon start crunching data towards game outcome probabilities in no time.

In the jpeg below, there are some useful 2010 NFL stats.

2010 NFL metrics

Median is the median point spread from 2010. HS is Brian Burke’s Homemade Sagarin metric. I’m not as fond of either of these as I was when I was implementing them. I think that an optimized Pythagorean expectation is a more predictive metric than either of those two. Pythagoreans are in the PRED column, expressed as a winning percentage. Multiply the percentage by 16 to get predicted wins for 2011. SRS, MOV, and SOS are Pro Football Reference’s simple ranking system metrics. SOS is a factor in playoff wins, along with previous playoff experience. Home field advantage is calculated from the Homemade Sagarin metric. Take it for what it’s worth. Other topside metrics are calculated with the Perl CPAN module Sport::Analytics::SimpleRanking, which I authored. The HS was implemented using Maggie Xiong’s PDL::Stats.

This is the  third of a series on drawing football diagrams, and this time we’ll be talking about drawing the defensive side of the ball. For now, we’re going to have the offense going “up” the image and the defense going “down” the image. It’s easy enough to invert. Draw the offense the way we show in Part 2, rotate the result by 180 degrees, and then add your defensive players. In the old days, the defense was indicated with triangles. Most football bloggers, however, like to use fonts with names on them for the defense. The problem with fonts is that fonts are often tied to an operating system, so using them well requires some familiarity with font families. A good introduction to font families is here. And to note, Helvetica is installed as part of Image Magick, so if you want a no nonsense solution that should just work, set your font to “Helvetica-Bold”.

Since we are using Image Magick to generate our graphics, we can add color at will to our diagrams, and so one convention we’re going to follow for now is to use shape and color to distinguish offense from defense. offenses will be in white, defenses in yellow. Other conventions we could use are:

• Using different shapes for linemen, linebackers, and defensive backs.
• Tilting the defensive symbol to indicate a slanted lineman.
• Shading the offensive lineman to indicate a shaded orientation on the  part of the defensive player.

For now, we’re going to use this image as the basis for our defenses. We’ve spoken about the Desert Swarm, a kind of double eagle defense, here.

Arizona versus Washington, 1992. I formation versus Desert Swarm. Whip (flex tackle) on TE side of formation..

And these are our attempts to duplicate that photograph. Obviously one corner and the free safety position are a product of speculation.

Defense in yellow, using symbols. Slant lineman denoted by tilt of triangles.

This graphic is a text based representation of the defense.

Helvetica-Bold is the font used here.

The images as displayed above are about 3/4 their actual size, so double click on them to see a full sized image (unless you’re using Chrome, in which case you’ll get a huge image).

Font notes:

To list the fonts that Image Magick can use by name, use the command (Win32/64 cmd window or Unix shell):

```convert -list font | more ```
fonts that are not listed here can be accessed by direct path to the font file itself. In Ubuntu/Linux, the Fontmatrix utility can be a big help in seeing which fonts are good and determining the font path.

In this article, the example code is going to be given in Perl, using the Image::Magick module.

Code samples:

In the first part of this series, we talked about creating football fields, and provided code that would create fields whose hash marks were at high school width, college width, and pro field width. We provided the code as a Windows Batch file that used the command line tool Image Magick to do the actual graphics manipulation. In this part, we’ll talk about taking a field and drawing offenses onto the canvas.

Most offensive players are drawn by using circles (and was done so even in the days of Dana Bible). Since the fields we have drawn are colored a light green, for contrast we’ll want the circles filled in white and with black as the paint color. There may be other circles you might want drawn, ones shaded on one side with black, and perhaps you want the offense going down the field instead of up. We’re not going to worry about orientation finesses, as you can use any number of graphics tools to flip and rotate the image however you want. But we will talk about ways to make other kinds of images.

Defensive code setups, to some extent, are going to be OS specific. That’s because people like to use fonts, and the fonts on Windows aren’t entirely mirrored by the fonts in MacOS or Linux.

We’re also going to start introducing some Perl into the mix of code we show. This is because Perl’s ability to create functions and subroutines will actually simplify the task of creating a graphics code library, for those skilled enough to use the approach.

This is a quickie post, as I’ve been working on a talk for the Atlanta Perl Mongers tonight. The topic is Chart::Clicker, the graphics software that Cory Watson has written. A lot of the graphs seen on this site were made with Chart::Clicker, and after learning a few new tricks, I now have this new plot of my winning versus draft picks chart.

Winning and draft picks per year are correlated.

Since Chart::Clicker doesn’t have an obvious labeling tool (that I can discover), I used Image::Magick’s annotate command (links here and here) to post process the plot.

I was, to some extent,  inspired by the article by Benjamin Morris on his blog Skeptical Sports, where he suggests that to win playoff games in the NBA, three factors are most important: winning percentage, previous playoff experience, and pace – a measure of possessions. Pace translated into the NFL would be a measure that would count elements such as turnovers and punts. In the NBA, a number of elements such as rebounds + turnovers + steals would factor in.

I’ve recently captured a set of NFL playoff data from 2001 to 2010, which I analyzed by converting those games into a number. If the home team won, the game was assigned a 1. If the visiting team won, the game was assigned a 0. Because of the way the data were organized, the winner of the Super Bowl was always treated as the home team.

I tested a variety of pairs of regular season statistical elements to see which ones correlated best with playoff winning percentage. The test of significance was a logistic regression (see also here), as implemented in the Perl module PDL::Stats.

Two factors emerge rapidly from this kind of analysis. The first is that playoff experience is important. By this we mean that a team has played any kind of playoff game in the previous two seasons. Playoff wins were not significant in my testing, by the way, only the experience of actually being in the playoffs. The second significant parameter was the SRS variable strength of schedule. Differences in SRS were not significant in my testing, but differences in SOS were. Playing tougher competition evidently increases the odds of winning playoff games.

I’ve been quiet a while, because I’ve been a little busy. A version of the simple ranking system, favored by Doug Drinen, is now coded as a CPAN module. CPAN, the Comprehensive Perl Archive Network, is a user contributed library, and thought to be Perl’s biggest strength.

The object that the SRS module creates can be used as the parent for other analysis, which is one reason for contributing it. A module that inherits function from the above also gets its game parsing functions for free. That’s one reason I went that route. Since I’m eventually wanting to think seriously about the “homemade Sagarin” technique  in a reproducible way, this is a place to start.

We’ll start on a small, pretty blog called “Sabermetrics Research” and this article, which encapsulates nicely what’s happening. Back when sabermetrics was a “gosh, wow!” phenomenon and mostly the kind of thing that drove aficionados to their campus computing facility, the phrase “sabermetrics” was okay. Now that this kind of analysis is going in-house (a group of  speakers (including Mark Cuban) are quoted here as saying that perhaps 2/3 of all basketball teams now have a team of analysts), it’s being called “analytics”. QM types, and  even the older analysts, need a more dignified word to describe what they do.

The tools are different. There is the phrase logistic regression all over the place (such as here and here). I’ve been trying to rebuild a toolset quickly. I can code stuff in from “Numerical Recipes” as needed, and if I need a heavyweight algorithm, I recall that NL2SOL (John Dennis was a Rice prof, I’ve met him) is available as part of the R language. Hrm. Evidently, NL2SOL is also available here. PDL, as a place to start, has been fantastic. It has hooks to tons of things, as well as their built-ins.

Logistics regression isn’t a part of PDL but it is a part of PDL::Stats, a freely available add on package, available through CPAN. So once I’ve gnawed on the techniques enough, I’d like to try and see if Benjamin Morris’s result, combining winning percentage and average point spread (which, omg, is now called MOV, for margin of victory) and showing that the combination is a better predictor of winning than either in basketball, carries over to football.

I suspect, given that Brian Burke would do a logistic regression as soon as tie his shoes, that it’s been done.

To show what PDL::Stats can do, I’ve implemented Brian Burke’s “Homemade Sagarin” rankings into a bit of code I published previously. The result? This simple technique had Green Bay ranked #1 at the end of the 2010 season.

There are some issues with this technique. I’ll be talking about that in another article.