Code


It was yesterday that Nathan Oyler asked me on Twitter if I could rewrite my Perl code to calculate offensive SRS and defensive SRS. Nathan, I believe, is working on a game or a simulation and wanted to be able to calculate these values. I replied, “Do you know how to calculate these?” and, after playing around a little, I can only conclude that the best way to handle this calculation is going to be a matter of debate.

That said, I have a way to calculate these numbers, but first we need a little theory. It starts with Chase Stuart’s comment on the Smart Football blog that these values are related to points for and points against. Given that, and the definition of margin of victory:

MOV(team) = ( “points for” – “points against” ) / games_played(team) = point spread/games_played(team)

We now need to define an average score. This works:

AVG_SCORE = points_for(all teams)/ games_played(all teams)

From these definitions and the hint Chase dropped, we define offensive MOV and defensive MOV this way.

OMOV(team) = ( points_for(team) – games_played(team)*AVG_SCORE ) / games_played(team)

DMOV(team) = ( games_played(team)*AVG_SCORE – points_against(team) ) / games_played(team)

So, rather than plugging in MOV to a SRS linear equation solver, you can plug in offensive MOV and defensive MOV and then you can get numbers that will help you calculate an OSRS and a DSRS.

I say will get you numbers because there a  gotcha, in that whenever you have OSOS and DSOS and they are of opposite sign, then there is no unique solution to the equation

SOS = OSOS + DSOS

as I can choose any constant c and the result

SOS = (OSOS + c) + (DSOS – c)

is also a solution. This kind of linear wandering around, the solver adding arbitrary constants to OSOS and DSOS, happens when you attempt to solve for these equations. The issue is, there is no one obvious solution to this problem, unlike regular SRS where the constraint “sum of all SRS must equal zero” applies.  Now if someone uncovers a constraint, let me know and I’ll be happy to code it. In the absence of such a rule so far, I’ve used this folk rule.

Reduce the magnitude of the OSOS and DSOS terms until the smaller of the two, in terms of absolute magnitude, is zero.

This is straightforward to code. That my solution is not the same as the one in Pro Football  Reference is easy enough to show. If I go to this page, I get these values for the 2007 New England Patriots. If I calculate OMOV and DMOV using my code, we can extract the DSOS and OSOS values for this calculation.

2007 New England Patriots
SRS OSRS DSRS OMOV DMOV OSOS DSOS
20.1 15.9 4.2 15.1 4.6 0.8 -0.4

 

and while my code uses 0.4 and 0 for OSOS and DSOS respectively, the evident values that Pro Football Reference uses are 0.8 and -0.4. All that clear now?

I’m pretty sure my SOS calculation isn’t the same as PFR’s either, as I seen differences in OSRS/DSRS that amount to a point or two. In some cases this occurs when my calc yields same signed OSOS and DSOS values, and in that case, I don’t modify them at all.

The source code I’ve used to do these calculations is given here, as a Perl module. A “snapshot” of the code fragment I use to feed the Perl module source is:

calc_osrs_and_dsrs

typical output is, for the 2007 season:

OSRS-DSRS-2007-First-Cut

And yes, there are plenty of unknowns at this point. PFR has never really given any details of their OSOS/DSOS calculations, or the normalization routines they use. DSRS and OSRS as implemented by them is a “black box”. This implementation may not, in the long run, be the best of them, but it is reasonably well documented.

Update: corrected DMOV definition. Rewritten slightly for clarity.

The recent success of DeMarco Murray has energized the Dallas fan base. Felix Jones is being spoken of as if he’s some kind of leftover (I know, a 5.1 YPC over a career is such a drag), and people are taking Murray’s 6.7 YPA for granted. That wasn’t the thing that got me in the fan circles. It’s that Julius Jones was becoming a whipping boy again, the source of every running back sin there is, and so I wanted to build some tools to help analyze Julius’s career, and at the same time, look at Marion Barber III’s numbers, since these two are historically linked.

We’ll start with this database, and a bit of sql, something to let us find running plays. The sql is:

select down, togo, description from nfl_pbp where season = 2007 and gameid LIKE "%DAL%" and description like "%J.Jones%" and not description LIKE '%pass%' and not description LIKE '%PENALTY on DAL%' and not description like '%kick%' and not description LIKE '%sacked%'

It’s not perfect. I’m not picking up plays where a QB is sacked and the RB recovers the ball. A better bit of SQL might help, but that’s a place to start. We bury this SQL into a program that then parses the description string for the statement “for X yards”, or alternatively, “for no gain”, and adds them all up. From this, we could calculate yards per carry, but more importantly, we’ll calculate run success and we’ll also calculate something I’m going to call a failure rate.

For our purposes, a failure rate is the number of plays that gained 2 yards or less, divided by the total number of running attempts, multiplied by 100. The purpose of the failure rate is to investigate whether Julius, in 2007, became the master of the 1 and 2 yard run. One common fan conception of his style of play in his last year in Dallas is that “he had plenty of long runs but had so many 1 and 2 yards runs as to be useless.” I wish to investigate that.

(more…)

Brian Burke has made available play by play data from 2002 to 2010 here, and it’s available as .CSV files. The files are actually pretty small, about 5 megs for a year’s worth of data. CSV is a convenient format, and the data themselves are well enough organized an Excel or OpenOffice junkie can use the product, and so can those of us who work with SQL databases. The advantage of a SQL database is the query language you inherit. And what we’re going to show is how to embed Brian’s data into a small simple SQLite database (see here for M. Richard Hipp’s site, and here for the Wikipedia article).

SQLite is a tiny SQL engine, about 250 kilobytes in size. That’s right, 250 kilobytes. It’s intended to be embedded in applications, and so it doesn’t have the overhead of an Internet service, the way MySQL and Postgres do. It is extensively used in things like browsers (Firefox), mail clients, and internet metrics applications (Unica’s Nettracker). It has an MIT open source license, and  there are commercial versions of this free product you can buy, if you’re into that thing. Oracle, among others, sells a commercial derivative of this free product.

A SQLite database is a single file, so once you create it,  you could move the file onto a USB stick and carry it around with you (or keep it on your Android phone). The database that results is about 55 megabytes in size, not much different in size from the cumulative .CSVs themselves.

Brian’s data lack a primary key, which is fine for spreadsheets, but creates issues in managing walks through sequential data in a database. We’ll create a schema file (we’ll call it schema.sql) as so:

Use a text editor to create it. With the sqlite3 binary, create a database by saying:


sqlite3 pbp.db
sqlite>.read schema.sql
sqlite>.tables
nfl_pbp
sqlite>.exit

Once that’s all done, we’ll use Perl and the DBI module to load these data into our SQLite table. Loading is fast so long as you handle the transaction as a single unit, with the $dbh->begin_work and $dbh->commit statements.

Once loaded, you can begin using the data almost immediately:

sqlite> select count(*) from nfl_pbp;
384809
sqlite> select distinct season from nfl_pbp;
2002
2003
2004
2005
2006
2007
2008
2009
2010
sqlite> select count( distinct gameid ) from nfl_pbp;
2381

As far as the data themselves go, I’ll warn you that the ydline field is a little “lazy”,  in that if you score a touchdown from the 20, the extra point play and the ensuing kick also “occur” on the 20. So you end up with interesting sql statements like this when you search the data:


sqlite> select count(*) from nfl_pbp where ydline = 1 and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
3370
sqlite> select count(*) from nfl_pbp where ydline = 1 and description like "%touchdown%" and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
1690

Using the DBI module, or whatever database interface your language supports, you can soon start crunching data towards game outcome probabilities in no time.

The value of a touchdown is a phrase used in formulas like this one

PASSER RANKING = (yards + 10*TDs – 45*Ints)/attempts

where the first thing that comes to mind is that the TD is worth 10 yards and the interception is worth 45 yards. But is it? A TD after all, is worth about 7 points, and in The Hidden Game of Football formulation, a turnover is worth 4 points. Therefore, a TD is worth considerably more than a turnover, but the formula values the TD less. How is that?

Well, let me reassure you that in the new passer rating of the Hidden Game of Football, the value of a touchdown is a constant, equal to 6.8 points or 85 yards. The interception of 4 points is usually valued at 45 yards instead of 50, because most interceptions don’t make it back to the line of scrimmage.

The field itself is zero valued at the 25 yard line. That means once you get to the one yard line, you have one yard to go of field and the TD is worth an additional 10 yards of value. That’s where the 10 comes from. It’s not the value of the touchdown, but the additional value of the touchdown not measured on the field itself.

But what does this additional term actually mean?

Figure 1. The basic linear scoring model of THGF. TD = 6, linear slope = 0.08 points/yard. The probability of a score goes to 1.0 as the goal line is approached.

Figure 2. The model of THGF's new passer rating. The difference between y value at 100 yards and TD equals 0.8 points or 10 yards. Maximum probability of a score approaches 75/85.

If you check out the figures above, Figure 1 is introduced in The Hidden Game  of Football on page 102, and features in just about all the descriptions of worth up until page 186, where we run into this text. The authors appear to be carving out a new formula from the refactored NFL formula they introduce in their book.

Awarding a 80 yard bonus for a touchdown pass makes no sense either. It’s like treating every TD pass as though it were a 80-yard bomb. Yet, the majority of touchdown passes are from inside the 25 yard line.

It’s not the bonus we’re objecting to-after all, the whole point of throwing a pass is to get the ball into the end zone-but the size of the bonus is way out of kilter. We advocate a 10 yard bonus for each touchdown pass. It’s still higher than the yardage on a lot of TD passes, but it allows for the fact that yardage is a lot harder to get once a team gets inside the opponent’s 25.

and without quite saying so, the authors introduce the model in Figure 2. To note, the value of the touchdown and the yardage value merge in Figure 1, but remain apart in Figure 2. This value, which I’ve called a barrier potential previously, is the product of a chance to score that’s less than a 1.0 probability as you reach the goal line.  If your chances maximize at merely 80%, you’ll end up with a model with a barrier potential.

If I have an objection to the quoted argument, it’s that it encourages the whole notion of double counting the touchdown “yardage”. The appropriate way to figure out the slope of any linear scoring model is by counting all scoring at a particular yard line, or within a particular part of the field (red zone scoring, for example, which could  be normalized to the 10 yard line). These are scoring models, after all, not touchdown models.

Where did 6.8 come from, instead of 7?

Whereas before I was thinking  it was 6 points for the TD and 0.8 points for the extra point, I’m now thinking it came from the same notions that drove the score value of 6.4 for Romer and 6.3 for Burke. It’s 7 points less the value of the runback. I’ve used 6.4 points to derive scoring models for PFR’s aya and the NFL passer rating, but on retrospect, those aren’t appropriate uses. These models tend to zero in value around 25 yards, whereas the Romer model has much higher initial slopes and reaches positive values faster than these linear models.

This value can be calculated, but the formula that results can’t be calculated directly. It can be solved iteratively, though, with a pretty short piece of code

Figure 3. Perl code to solve for slope, effective TD value and y value at 100 yards in linear scoring models.

Figure 4. Solving for barriers of 10 and 20 yards.

And the solution is close enough to 6.8 that it’s easy enough to ignore the difference. Plugging 7 points for the touchdown, 20 and 29.1 yards respectively for the barrier potential yields almost no changes in the touchdown value for  the PFR aya model and the NFL passer rating formula, and we end up with these scoring model plots.

Figure 5. PFR aya amended model. TD = 7 points, slope = 0.075 points/yard, y at 100 = 5.5 points.

Figure 6. Amended NFL prf scoring model. TD = 7.05 points, slope = 0.07 points/yard, y at 100 = 5.0 points.

I’ve just started reading this book

and if only for the introduction, people  need to take a look at this book. This quote is pretty important to folks who want to understand how football analytics actually works, as opposed to what people tell you..

The other trick in finding ideas is figuring out the difference between power and knowledge. Of all the people whom you’ll meet in this  volume, very few of them are powerful or even famous. When I said I’m most  interested in minor geniuses, that’s what I mean.   You don’t start at the top if you want the story. You start in the middle, because the people in  the middle who do the actual work in the world….People at the top are self-conscious about what they say (and rightfully so) because they have position and  privilege to protect – and self-consciousness is the enemy of “interestingness”.

The more I read smaller blogs, the more I understand and the better I understand what I’m doing. To note, the Hidden Game of Football is also a worthwhile read, as those guys put a lot of effort into their work, into making it understandable, and a deeper read usually pays off in deeper understanding of concepts.

In Gladwell’s  book, there is a discussion of Nassim Taleb, currently a darling because of his contrarian views about randomness and its place in economics. But more immediately useful as a metaphor is Malcolm’s discussion of ketchup. He makes a strong case that the old ketchup formula endures because it’s hard to improve on.  It has just about  the right amounts of everything in the flavor spectrum to make it work for most people. I’m thinking the old NFL passer rating formula is much like that, though the form of  the equation is a little difficult for most people to absorb. I’ll be touching on ways to look at the passer rating in a much simplified form shortly.

Another story is in order here, the story of the sulfa drugs. To begin, recall that the late 19th century spawned a revolution in organic chemistry, which first manifested in new, colorful dyes. And not just clothing dyes, but also the art of tissue staining. The master of tissue staining back in the day was one Paul Ehrlich, who from his understanding of staining specific tissues, came up with  the notion of the “magic bullet”. In other words, find a stain that binds specifically to pathogens, attach a poison to the stain, and thereby selectively kill bacteria and other pathogens. His drug Salvarsan was the first modern antibacterial and his work set the stage for more sophisticated drugs.

Bayer found  the first of the new drugs, protonsil, by examining coal-tar dyes. However it only worked in live animals. A French team later found that in the body, the drug was cleaved into two parts, a medically inactive dye, and a medically active and colorless drug  that later became known as sulfanilamide. The dye portion of the magic bullet was unnecessary. Color wasn’t necessary to make the drug “stick”.

When dealing with formulas, you need to figure out ways to cut  the dye out of the equation, reduce formulas to their essence. Mark Bittman does that with recipes, and his Minimalist column in the Times is a delight to read. And  in football, needless complication just gets in the way. Figure it out, and then ruthlessly simplify it. And I suspect that’s the best path to  understanding why certain old formulas still have functional relevance in modern times.

Update: added link to new article. Fixed mixing of phrases silver bullet and magic bullet

Where did that  Pythagorean exponent of 2.37 really come from?

Football Outsiders has published their latest annual. You can get it in PDF form, and whatever gripes I have about the particulars of their methods, I’d also say just buy it and enjoy the writing.  I read something in the latest annual worth mentioning, that the Pythagorean exponent of 2.37 that Pro Football Reference attributes to a blogger named Matt on a blog named Statistically Speaking (via a link that no longer exists) is actually a result from Houston Rockets GM and former STATS inc employee Daryl Morey.

Not only does FO mention it in the 2011 annual, but Aaron Schatz mentions it in a pair of 2005 interviews (here and here) with Baseball Prospectus. The result is mentioned also in a 2005 New York Times article, and then in a 2003 article on the FO site itself, where he gives the link to Daryl Morey’s web site (the link no longer works). Chasing down the url http://morey.org leads to the MIT Sloan Analytics site (morey.org is now a redirect). If “morey.org” is used as a search term, then the search gives you a link to an article on the Harvard Business Review site by Daryl Morey, an important one.

The 2003 article, by  the way, makes it clear that the Pythagorean formula of Daryl Morey dates to 1990 and is thus 21 years old. In the Pro Football Reference article, a Stuart Chase (whose link in his name points back to the Football Guys site) says that the average Pythagorean exponent from 1990 to 2007 is 2.535, and I’ve posted results that show no, it sure isn’t 2.37 over the last decade. If one were to average my exponents, calculated annually, from 2001 to 2010, they would be much closer to 2.5 as well.

Also, note, my code is now part of the Perl CPAN library. You don’t need to believe me, get the data and do the calculation yourself.

In short, the use of 2.37 is an old, outdated 21 year old  trope.

I tend to like Pythagorean expectations because of all the scoring stats I’ve tested for predicting NFL playoff wins, this one comes closest to being reliable (p = 0.17, where p=0.05 or less desired).

Bashing on DVOA

I’ve posted a complaint previously about proprietary formulas, some issues being that they aren’t verifiable, and further, they aren’t falsifiable.  Some more gripes: back in the 2005 interviews on Baseball Reference, Aaron Schatz says that the average around which DVOA is based was based on a single season. In the 2011 annual, it’s made clear that the average on which DVOA is based is over more than one year. In other words, DVOA isn’t a single well defined commodity at all, the definition is changing over time. Of course, we only have FO’s word for  it, as (once again) the formula is proprietary (For all its faults, the NFL QBR is well understood, verifiable and falsifiable).

It’s the data, stupid.

This is where Daryl Morey comes in. The argument in his recent article is that analysts are becoming more common, their skills are high, the formulas and methods aren’t where the action is at. Who cares? The important element are the data sets themselves.

With the Moneyball movie set to open next month, the world will once again be gaga over the power of smart analytics to drive success. While you are watching the movie, however, think about the fact that the high revenue teams, such as the Red Sox, went out and hired smart analysts and quickly eroded any advantage the Oakland A’s had. If there had been a proprietary data set that Oakland could have built to better value players than the competition, their edge may have been sustainable.

If  data trumps formulas, why all these proprietary formulas? What’s the point?

These kinds of notions are one reason I’ve come to like Brian Burke and Advanced Football Stats more and more. He tends to give out small but useful data sets. He tends to strip the mystery off various proprietary formula bases. He tends to tell you how he does things. He’s willing to debunk nonsense.

I’m sure there are some cards hidden in Brian’s deck, but far less than the other guys. I’m really of the opinion that formulas are meant to be verified and falsified. Data sets? Gather those, sell those, work was involved in collecting and creating  them. Analysis based on  those data sets? Sell that too. Formulas? Write in Python or Perl or Ruby, write in the standard required by the common language library (either PyPI or CPAN or RubyForge) and upload your code for all to use. Since the code then gets put through a stock test harness, the reliability of  the code also becomes more transparent.

This is the  third of a series on drawing football diagrams, and this time we’ll be talking about drawing the defensive side of the ball. For now, we’re going to have the offense going “up” the image and the defense going “down” the image. It’s easy enough to invert. Draw the offense the way we show in Part 2, rotate the result by 180 degrees, and then add your defensive players. In the old days, the defense was indicated with triangles. Most football bloggers, however, like to use fonts with names on them for the defense. The problem with fonts is that fonts are often tied to an operating system, so using them well requires some familiarity with font families. A good introduction to font families is here. And to note, Helvetica is installed as part of Image Magick, so if you want a no nonsense solution that should just work, set your font to “Helvetica-Bold”.

Since we are using Image Magick to generate our graphics, we can add color at will to our diagrams, and so one convention we’re going to follow for now is to use shape and color to distinguish offense from defense. offenses will be in white, defenses in yellow. Other conventions we could use are:

  • Using different shapes for linemen, linebackers, and defensive backs.
  • Tilting the defensive symbol to indicate a slanted lineman.
  • Shading the offensive lineman to indicate a shaded orientation on the  part of the defensive player.

For now, we’re going to use this image as the basis for our defenses. We’ve spoken about the Desert Swarm, a kind of double eagle defense, here.

Arizona versus Washington, 1992. I formation versus Desert Swarm. Whip (flex tackle) on TE side of formation..

And these are our attempts to duplicate that photograph. Obviously one corner and the free safety position are a product of speculation.

Defense in yellow, using symbols. Slant lineman denoted by tilt of triangles.

This graphic is a text based representation of the defense.

Helvetica-Bold is the font used here.

The images as displayed above are about 3/4 their actual size, so double click on them to see a full sized image (unless you’re using Chrome, in which case you’ll get a huge image).

Font notes:

To list the fonts that Image Magick can use by name, use the command (Win32/64 cmd window or Unix shell):

convert -list font | more

fonts that are not listed here can be accessed by direct path to the font file itself. In Ubuntu/Linux, the Fontmatrix utility can be a big help in seeing which fonts are good and determining the font path.

In this article, the example code is going to be given in Perl, using the Image::Magick module.

Code samples:

Previous parts of this article:

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 245 other followers