I’m doing a brief review of Python again, as it relates to things that draft fans might like, and note that the random and statistics modules all seem pretty useful.

So, the design goal here is: can we make a good enough simulation to tell us something about draft strategy. Can we learn something about BPA versus need by using Python code? Right now I don’t have an answer, but I can show you some of the approach so far.

One thing I’ve found if you’re moving from another language into Python, that you can eliminate a lot of scope issues if you’ll do certain substantial bits of work in a Python class. The scope of self variables is easy to measure and then you’re not wondering whether the common variable in Python has exactly the same scope, as say, a lexical in Perl.

So for now, we present the Playa class, a “draftable” object.

import random
from statistics import mean
from pprint import pprint


class Playa:
    def __init__(self, oldid=0):
        self.value = random.randrange(1,101)
        self.pos = self.getposition() = oldid + 1
        self.drafted = False
        self.meanshift = -1000.0

    def __repr__(self):
        return "Playa id:{0:3d} pos:{1:s} val:{2:3d}".format(, self.pos, self.value )

    def out(self):
        return "id:{0:3d} pos:{1:s} val:{2:3d}".format(, self.pos, self.value )

    def getposition(self):
        poslist = ["QB","RB","WR","FL","SR","TE","LT","LG","RT","RG","OC"]
        return poslist[random.randrange(0,11)]

    def draft(self):
        self.drafted = True

This object will allow us to generate players and then associate them with teams. Players can be identified by their id, a draft value can be derived from their real value (1-100), and a logical variable shows whether they are drafted or not.

I’m only using offensive positions in this simulation. And since more and more teams use a slot receiver as opposed to a fullback, we have “SR” in our position charts.

If with 32 teams, you generate 320 players per draft, then the values of 1 to 100 break nicely, as real value of 91 to 100 are first round talent, 81 to 90 are second round talent, and so on.


This question came up when I was looking up the last year in the playoffs for seven probable NFC playoff teams. Both New Orleans and Philadelphia last played in the playoffs four years ago, in 2013. And then the thought came up in my head, “But Drew Brees is a veteran QB.” This seems intuitive, but wanting to actually create such a definition and then later to test this using a logistic regression, there is the rub.

There are any number of QBs a fan can point to and see that the QB mattered. Roger Staubach seemed a veteran in this context back in the 1970s, Joe Montana in the 1980s, Ben Roethlisberger in the 21st century, Eli Manning in 2011, and Aaron Rogers last year. But plenty of questions abound. If a veteran QB is an independent variable whose presence or absence changes the odds of winning a playoff game, what tools do we use to define such a person? What tools would we use to eliminate entanglement, in this case between the team’s overall offensive strength and the QB himself?

The difference between a good metric and a bad metric can be seen when looking at the effect of the running game on winning. The correlation between rushing yards per carry and winning is pretty small. The correlation between run success rate and winning are larger. In short, being able to reliably make it on 3rd and 1 contributes more to success than running 5 yards a carry as opposed to 4.

At this point I’m just discussing the idea. With a definition in mind, we can do one independent variable logistic regression tests. Then with a big enough data set – 15 years of playoff data should be enough, we can start testing three independent variable logistic models (QB + SOS + PPX).

I’ve been curious, since I took on a new job and a new primary language at work, to what extent I could begin to add Python to the set of tools that I could use for football analytics. For one, the scientific area where the analyst needs the most help from experts is in optimization theory and algorithms, and at this point in time, the developments in Python are more extensive than Perl.

To start you have the scipy and numpy packages, with scipy.optimize having diverse tools for minimization and least squares fitting.  Logistic regressions in python are discussed here,  and lmfit provides some enhancements to the fitting routines in scipy.  But to start we need to be able to read and write existing data, and from that then write the SRS routines. The initial routines were to be based on my initial SRS Perl code, so don’t be surprised if code components looks very familiar.

This code will use an ORM layer, SQLAlchemy, to get to my existing databases, and to create the Class used to fetch the data, we used a python executable named sqlacodegen. We set up sqlacodegen in a virtual environment and tried it out.  The output was:

# coding: utf-8
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
metadata = Base.metadata

class Game(Base):
    __tablename__ = 'games'

    id = Column(Integer, primary_key=True)
    week = Column(Integer, nullable=False)
    visitor = Column(String(80))
    visit_score = Column(Integer, nullable=False)
    home = Column(String(80))
    home_score = Column(Integer, nullable=False)

Which, with slight mods, can be used to read my data. The whole test program is here:

from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from pprint import pprint

def srs_correction(tptr = {}, num_teams = 32):
    sum = 0.0
    for k in tptr:
        sum += tptr[k]['srs']
    sum = sum/num_teams
    for k in tptr:
        tptr[k]['srs'] -= sum
        tptr[k]['sos'] -= sum 

def simple_ranking(tptr = {}, correct = True, debug = False):
    for k in tptr:
        tptr[k]['mov'] = tptr[k]['point_spread']/float(tptr[k]['games_played'])
        tptr[k]['srs'] = tptr[k]['mov']
        tptr[k]['oldsrs'] = tptr[k]['srs']
        tptr[k]['sos'] = 0.0
    delta = 10.0
    iters = 0
    while ( delta > 0.001 ):
        iters += 1
        if iters > 10000:
            return True
        delta = 0.0
        for k in tptr:
            sos = 0.0
            for g in tptr[k]['played']:
                sos += tptr[g]['srs']
            sos = sos/tptr[k]['games_played']
            tptr[k]['srs'] = tptr[k]['mov'] + sos
            newdelta = abs( sos - tptr[k]['sos'] )
            tptr[k]['sos'] = sos
            delta = max( delta, newdelta )
        for k in tptr:
            tptr[k]['oldsrs'] = tptr[k]['srs']
    if correct:
        srs_correction( tptr )
    if debug:
        print("iters = {0:d}".format(iters)) 
    return True     

year = "2001"
userpass = "user:pass"

nfl = "mysql+pymysql://" + userpass + "@localhost/nfl_" + year
engine = create_engine(nfl)

Base = declarative_base(engine)
metadata = Base.metadata

class Game(Base):
    __tablename__ = 'games'
    id = Column(Integer, primary_key=True)
    week = Column(Integer, nullable=False)
    visitor = Column(String(80))
    visit_score = Column(Integer, nullable=False)
    home = Column(String(80))
    home_score = Column(Integer, nullable=False)

Session = sessionmaker(bind=engine)
session = Session()
res = session.query(Game).order_by(Game.week).order_by(Game.home)

tptr = {}
for g in res:
#    print("{0:d} {1:s} {2:d} {3:s} {4:d}".format( g.week, g.home, g.home_score, g.visitor, g.visit_score ))
    if g.home not in tptr:
        tptr[g.home] = {}
        tptr[g.home]['games_played'] = 1
        tptr[g.home]['point_spread'] = g.home_score - g.visit_score
        tptr[g.home]['played'] = [ g.visitor ]
        tptr[g.visitor] = {}
        tptr[g.visitor]['games_played'] = 1
        tptr[g.visitor]['point_spread'] = g.visit_score - g.home_score
        tptr[g.visitor]['played'] = [ g.home ]
        tptr[g.home]['games_played'] += 1
        tptr[g.home]['point_spread'] += (g.home_score - g.visit_score)
        tptr[g.home]['played'] += [ g.visitor ]
        tptr[g.visitor]['games_played'] += 1
        tptr[g.visitor]['point_spread'] += ( g.visit_score - g.home_score )
        tptr[g.visitor]['played'] += [ g.home ]

simple_ranking( tptr )
for k in tptr:
    print("{0:10s} {1:6.2f} {2:6.2f} {3:6.2f}".format( k, tptr[k]['srs'],tptr[k]['mov'], tptr[k]['sos']))

The output was limited to two digits past the decimal and to that two digits past decimal of precision, my results are the same as my Perl code. The routines should look a lot the same. The only real issue is that you have to float one of the numbers when you calculate margin of victory, as the two inputs are integers. Python isn’t as promiscuous in type conversion as Perl is.

Last note. Although we included pprint, at this point we’re not using it. That’s because with the kind of old fashioned debugging skills I have, I use pprint the way a Perl programmer might use Data::Dumper, to look at data structures while developing a program.

Update: the original Doug Drinen post about the Simple Ranking System has a new url. You can now find it here.</


Perhaps the most important new thing I note is that Pro Football Reference now has play by play data, and ways to display those data as a CSV format. Creating parsers for the data would be work, but that means that advanced stats are now accessible to the average fan.

In Ubuntu 16.04, PDL::Stats is now a standard Ubuntu package and so the standard PDL installation can be used with my scripts. About the only thing you need to use CPAN for, at this point, is installing Sport::Analytics::SimpleRanking.

At work I use a lot of Python these days. I have not had time to rethink all this into Pythonese. But I’m curious, as the curve fitting tools in Python are better/different than those in Perl.

Football diagrams: Although the Perl module Graphics::Magick isn’t a part of CPAN, graphicsmagick and libgraphics-magick-perl are part of the Ubuntu repositories.


It was yesterday that Nathan Oyler asked me on Twitter if I could rewrite my Perl code to calculate offensive SRS and defensive SRS. Nathan, I believe, is working on a game or a simulation and wanted to be able to calculate these values. I replied, “Do you know how to calculate these?” and, after playing around a little, I can only conclude that the best way to handle this calculation is going to be a matter of debate.

That said, I have a way to calculate these numbers, but first we need a little theory. It starts with Chase Stuart’s comment on the Smart Football blog that these values are related to points for and points against. Given that, and the definition of margin of victory:

MOV(team) = ( “points for” – “points against” ) / games_played(team) = point spread/games_played(team)

We now need to define an average score. This works:

AVG_SCORE = points_for(all teams)/ games_played(all teams)

From these definitions and the hint Chase dropped, we define offensive MOV and defensive MOV this way.

OMOV(team) = ( points_for(team) – games_played(team)*AVG_SCORE ) / games_played(team)

DMOV(team) = ( games_played(team)*AVG_SCORE – points_against(team) ) / games_played(team)

So, rather than plugging in MOV to a SRS linear equation solver, you can plug in offensive MOV and defensive MOV and then you can get numbers that will help you calculate an OSRS and a DSRS.

I say will get you numbers because there a  gotcha, in that whenever you have OSOS and DSOS and they are of opposite sign, then there is no unique solution to the equation


as I can choose any constant c and the result

SOS = (OSOS + c) + (DSOS – c)

is also a solution. This kind of linear wandering around, the solver adding arbitrary constants to OSOS and DSOS, happens when you attempt to solve for these equations. The issue is, there is no one obvious solution to this problem, unlike regular SRS where the constraint “sum of all SRS must equal zero” applies.  Now if someone uncovers a constraint, let me know and I’ll be happy to code it. In the absence of such a rule so far, I’ve used this folk rule.

Reduce the magnitude of the OSOS and DSOS terms until the smaller of the two, in terms of absolute magnitude, is zero.

This is straightforward to code. That my solution is not the same as the one in Pro Football  Reference is easy enough to show. If I go to this page, I get these values for the 2007 New England Patriots. If I calculate OMOV and DMOV using my code, we can extract the DSOS and OSOS values for this calculation.

2007 New England Patriots
20.1 15.9 4.2 15.1 4.6 0.8 -0.4


and while my code uses 0.4 and 0 for OSOS and DSOS respectively, the evident values that Pro Football Reference uses are 0.8 and -0.4. All that clear now?

I’m pretty sure my SOS calculation isn’t the same as PFR’s either, as I seen differences in OSRS/DSRS that amount to a point or two. In some cases this occurs when my calc yields same signed OSOS and DSOS values, and in that case, I don’t modify them at all.

The source code I’ve used to do these calculations is given here, as a Perl module. A “snapshot” of the code fragment I use to feed the Perl module source is:


typical output is, for the 2007 season:


And yes, there are plenty of unknowns at this point. PFR has never really given any details of their OSOS/DSOS calculations, or the normalization routines they use. DSRS and OSRS as implemented by them is a “black box”. This implementation may not, in the long run, be the best of them, but it is reasonably well documented.

Update: corrected DMOV definition. Rewritten slightly for clarity.


The recent success of DeMarco Murray has energized the Dallas fan base. Felix Jones is being spoken of as if he’s some kind of leftover (I know, a 5.1 YPC over a career is such a drag), and people are taking Murray’s 6.7 YPA for granted. That wasn’t the thing that got me in the fan circles. It’s that Julius Jones was becoming a whipping boy again, the source of every running back sin there is, and so I wanted to build some tools to help analyze Julius’s career, and at the same time, look at Marion Barber III’s numbers, since these two are historically linked.

We’ll start with this database, and a bit of sql, something to let us find running plays. The sql is:

select down, togo, description from nfl_pbp where season = 2007 and gameid LIKE "%DAL%" and description like "%J.Jones%" and not description LIKE '%pass%' and not description LIKE '%PENALTY on DAL%' and not description like '%kick%' and not description LIKE '%sacked%'

It’s not perfect. I’m not picking up plays where a QB is sacked and the RB recovers the ball. A better bit of SQL might help, but that’s a place to start. We bury this SQL into a program that then parses the description string for the statement “for X yards”, or alternatively, “for no gain”, and adds them all up. From this, we could calculate yards per carry, but more importantly, we’ll calculate run success and we’ll also calculate something I’m going to call a failure rate.

For our purposes, a failure rate is the number of plays that gained 2 yards or less, divided by the total number of running attempts, multiplied by 100. The purpose of the failure rate is to investigate whether Julius, in 2007, became the master of the 1 and 2 yard run. One common fan conception of his style of play in his last year in Dallas is that “he had plenty of long runs but had so many 1 and 2 yards runs as to be useless.” I wish to investigate that.



Brian Burke has made available play by play data from 2002 to 2010 here, and it’s available as .CSV files. The files are actually pretty small, about 5 megs for a year’s worth of data. CSV is a convenient format, and the data themselves are well enough organized an Excel or OpenOffice junkie can use the product, and so can those of us who work with SQL databases. The advantage of a SQL database is the query language you inherit. And what we’re going to show is how to embed Brian’s data into a small simple SQLite database (see here for M. Richard Hipp’s site, and here for the Wikipedia article).

SQLite is a tiny SQL engine, about 250 kilobytes in size. That’s right, 250 kilobytes. It’s intended to be embedded in applications, and so it doesn’t have the overhead of an Internet service, the way MySQL and Postgres do. It is extensively used in things like browsers (Firefox), mail clients, and internet metrics applications (Unica’s Nettracker). It has an MIT open source license, and  there are commercial versions of this free product you can buy, if you’re into that thing. Oracle, among others, sells a commercial derivative of this free product.

A SQLite database is a single file, so once you create it,  you could move the file onto a USB stick and carry it around with you (or keep it on your Android phone). The database that results is about 55 megabytes in size, not much different in size from the cumulative .CSVs themselves.

Brian’s data lack a primary key, which is fine for spreadsheets, but creates issues in managing walks through sequential data in a database. We’ll create a schema file (we’ll call it schema.sql) as so:

Use a text editor to create it. With the sqlite3 binary, create a database by saying:

sqlite3 pbp.db
sqlite>.read schema.sql

Once that’s all done, we’ll use Perl and the DBI module to load these data into our SQLite table. Loading is fast so long as you handle the transaction as a single unit, with the $dbh->begin_work and $dbh->commit statements.

Once loaded, you can begin using the data almost immediately:

sqlite> select count(*) from nfl_pbp;
sqlite> select distinct season from nfl_pbp;
sqlite> select count( distinct gameid ) from nfl_pbp;

As far as the data themselves go, I’ll warn you that the ydline field is a little “lazy”,  in that if you score a touchdown from the 20, the extra point play and the ensuing kick also “occur” on the 20. So you end up with interesting sql statements like this when you search the data:

sqlite> select count(*) from nfl_pbp where ydline = 1 and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
sqlite> select count(*) from nfl_pbp where ydline = 1 and description like "%touchdown%" and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";

Using the DBI module, or whatever database interface your language supports, you can soon start crunching data towards game outcome probabilities in no time.


Next Page »