### April 2011

I was, to some extent,  inspired by the article by Benjamin Morris on his blog Skeptical Sports, where he suggests that to win playoff games in the NBA, three factors are most important: winning percentage, previous playoff experience, and pace – a measure of possessions. Pace translated into the NFL would be a measure that would count elements such as turnovers and punts. In the NBA, a number of elements such as rebounds + turnovers + steals would factor in.

I’ve recently captured a set of NFL playoff data from 2001 to 2010, which I analyzed by converting those games into a number. If the home team won, the game was assigned a 1. If the visiting team won, the game was assigned a 0. Because of the way the data were organized, the winner of the Super Bowl was always treated as the home team.

I tested a variety of pairs of regular season statistical elements to see which ones correlated best with playoff winning percentage. The test of significance was a logistic regression (see also here), as implemented in the Perl module PDL::Stats.

Two factors emerge rapidly from this kind of analysis. The first is that playoff experience is important. By this we mean that a team has played any kind of playoff game in the previous two seasons. Playoff wins were not significant in my testing, by the way, only the experience of actually being in the playoffs. The second significant parameter was the SRS variable strength of schedule. Differences in SRS were not significant in my testing, but differences in SOS were. Playing tougher competition evidently increases the odds of winning playoff games.

On a wet April weekend, what better way to spend some time than looking for an exotic football front? And in this, Rob Ryan seldom disappoints.

We’ll be looking at some Rob Ryan fronts that can be found on NFL.com video  of the week 14 game between Pittsburgh and Cleveland, 2009. This is when Cleveland began a 4-0 tear to end the season.

I’ve seen Rob Ryan stand up the defensive ends in what initially looks like a 4 man front but not the tackles, until now:

And in this front, you see a 2-4 nickel front, looking a bit like a 3-4 with the LDE of a 3-4 having been replaced with an extra defensive back.

And what would a Rob Ryan survey be without a couple shots of no down lineman (cloud) defenses?

It’s a classic Bill James formula and yet another tool that points to scoring being a more important indicator of winning potential than actually winning. The formula goes:

win percent = (points scored)**2/((points scored)**2 + (points allowed)**2)

The Wikipedia writes about the formula here, and Pro Football Reference writes about it here, and well, is it really true that the exponent in football is 2.37, and not 2? One of the advantages in having an object that calculates these things (i.e. version 0.2 of Sport::Analytics::SimpleRanking, which I’m testing) is that I can just test.

What my code does is compute the best fit exponent, in a least squares sense, with the winning percentage of the club. And as Doug Drinen has noted, the Pythagorean expectation translates better into next years winning percentage than does actual winning percentage. My code is using a golden section search to find the exponent.

Real percentage versus the predicted percentages in 2010.

Anyway, the best fit exponent values I calculate for the years 2001 through 2010 are:

• 2001: 2.696
• 2002: 2.423
• 2003: 2.682
• 2004: 2.781
• 2005: 2.804
• 2006: 2.394
• 2007: 2.509
• 2008: 2.620
• 2009: 2.290
• 2010: 2.657

No, not quite 2.37, though I differ from PFR by about 0.02 in the year 2006. Just glancing at it and knowing how approximate these things are, 2.5 probably works in a pinch. The difference between an exponent of 2 and 2.37, for say, the Philadelphia Eagles in 2007 amounts to about 0.2 games in predicted wins over the course of a season.

I’ve been quiet a while, because I’ve been a little busy. A version of the simple ranking system, favored by Doug Drinen, is now coded as a CPAN module. CPAN, the Comprehensive Perl Archive Network, is a user contributed library, and thought to be Perl’s biggest strength.

The object that the SRS module creates can be used as the parent for other analysis, which is one reason for contributing it. A module that inherits function from the above also gets its game parsing functions for free. That’s one reason I went that route. Since I’m eventually wanting to think seriously about the “homemade Sagarin” technique  in a reproducible way, this is a place to start.

We’ll start on a small, pretty blog called “Sabermetrics Research” and this article, which encapsulates nicely what’s happening. Back when sabermetrics was a “gosh, wow!” phenomenon and mostly the kind of thing that drove aficionados to their campus computing facility, the phrase “sabermetrics” was okay. Now that this kind of analysis is going in-house (a group of  speakers (including Mark Cuban) are quoted here as saying that perhaps 2/3 of all basketball teams now have a team of analysts), it’s being called “analytics”. QM types, and  even the older analysts, need a more dignified word to describe what they do.

The tools are different. There is the phrase logistic regression all over the place (such as here and here). I’ve been trying to rebuild a toolset quickly. I can code stuff in from “Numerical Recipes” as needed, and if I need a heavyweight algorithm, I recall that NL2SOL (John Dennis was a Rice prof, I’ve met him) is available as part of the R language. Hrm. Evidently, NL2SOL is also available here. PDL, as a place to start, has been fantastic. It has hooks to tons of things, as well as their built-ins.

Logistics regression isn’t a part of PDL but it is a part of PDL::Stats, a freely available add on package, available through CPAN. So once I’ve gnawed on the techniques enough, I’d like to try and see if Benjamin Morris’s result, combining winning percentage and average point spread (which, omg, is now called MOV, for margin of victory) and showing that the combination is a better predictor of winning than either in basketball, carries over to football.

I suspect, given that Brian Burke would do a logistic regression as soon as tie his shoes, that it’s been done.

To show what PDL::Stats can do, I’ve implemented Brian Burke’s “Homemade Sagarin” rankings into a bit of code I published previously. The result? This simple technique had Green Bay ranked #1 at the end of the 2010 season.

There are some issues with this technique. I’ll be talking about that in another article.

This is a defensive front from the Pittsburgh-Atlanta game. Look at it for 2 seconds. Is it a 46 or not?

So what is it?

It’s easy to confuse until you see the DB lined up over the slot receiver. The linemen  aren’t spaced the way a 46 would be, but.. I suspect you can get a 46 effect out of a 34 front by pinching the ends into the offensive guards.

I spent a lot of time looking at other teams and wasting that time. No fronts of interest to speak of. Now, Pittsburgh tends to show a lot of 34 looks, but there is so much motion in  their linebackers that  they tend to keep someone like me engaged. For example, what’s happening here?

Some things to note: the front is shifted to the weak side of the formation. LDE is over the guard,  the NT appears to be in the “A” gap, and the RDE is outside the LT.  The result was that Matt Ryan ended up being intercepted by Troy Polamalu.

We’ve spoken about the simple ranking system before, and given code to calculate it. I want to set up a “More” mark, and talk issues with the algorithm and more hard core tech after the mark.

What we’re aiming for are Perl implementations of common predictive systems. We’re going to build them and  then run them against an exhaustive grid of data and game categories. I want to see what kinds of games these models predict best, and which ones they work worst for. That’s what all this coding is heading for: methods to validate the predictive ability of simple models.

What I’m going to talk about now is an implementation of the Simple Ranking System in Perl. The Simple Ranking System is described on Pro Football Reference here. It’s important because it’s a simple – perhaps the simplest – model of the form

team strength = a(Point Spread) + b(Correction Factor)

where a and b are small positive real numbers. In SRS, a = 1 and b = 1/(total number of games played). The correction factor is the sum of the team strengths of all the team’s opponents.

The solution described by Doug Drinen on the Pro Football  Reference page isn’t the matrix solution, but an iterative one. You simply do the calculation over and over again until you get close enough.

This is a book about the origins of baseball and right off, it shreds any notion you might have had about Abner Doubleday being the creator of America’s original big time sport.

John Thorn, who is also the editor of Total Football, has written an engaging account of the wide varieties of games that were being played in the eighteenth and early nineteenth centuries. This is an era where things like the “Massachusetts game” and the “New York game” were in vogue, where things called “cat” and “town ball” were played, when cricket was so popular it might have been America’s game. It’s a long excursion into the variants of the day, the slow evolution  towards a game that is recognizably modern baseball, and ruminations on how things like the quality of the ball affected the game on  the field. This one falls into the “must read” category, because it’s a celebration of America’s history as a sporting nation.

The median is a more robust statistical measure than the mean. If you have a team that won by 50, won by 5, won by 3, lost by 1 and another that won by 8, won by 5, won by 3, and lost by 100, they would both have a record of 3-1 and both have a median point spread of 4. It’s a measure that doesn’t reward or punish blowouts. In the language of statistics, the median is less affected by outlying values.

I’m introducing the idea to step up to some later ideas, more “out there”, but for now, let’s leave you with some stats from the 2010 season, and the notion that for a 10-6 team, Green Bay had a pretty nice median point spread.

Next Page »