Code


I’ve been curious, since I took on a new job and a new primary language at work, to what extent I could begin to add Python to the set of tools that I could use for football analytics. For one, the scientific area where the analyst needs the most help from experts is in optimization theory and algorithms, and at this point in time, the developments in Python are more extensive than Perl.

To start you have the scipy and numpy packages, with scipy.optimize having diverse tools for minimization and least squares fitting.  Logistic regressions in python are discussed here,  and lmfit provides some enhancements to the fitting routines in scipy.  But to start we need to be able to read and write existing data, and from that then write the SRS routines. The initial routines were to be based on my initial SRS Perl code, so don’t be surprised if code components looks very familiar.

This code will use an ORM layer, SQLAlchemy, to get to my existing databases, and to create the Class used to fetch the data, we used a python executable named sqlacodegen. We set up sqlacodegen in a virtual environment and tried it out.  The output was:

# coding: utf-8
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
metadata = Base.metadata

class Game(Base):
    __tablename__ = 'games'

    id = Column(Integer, primary_key=True)
    week = Column(Integer, nullable=False)
    visitor = Column(String(80))
    visit_score = Column(Integer, nullable=False)
    home = Column(String(80))
    home_score = Column(Integer, nullable=False)

Which, with slight mods, can be used to read my data. The whole test program is here:

from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from pprint import pprint

def srs_correction(tptr = {}, num_teams = 32):
    sum = 0.0
    for k in tptr:
        sum += tptr[k]['srs']
    sum = sum/num_teams
    for k in tptr:
        tptr[k]['srs'] -= sum
        tptr[k]['sos'] -= sum 

def simple_ranking(tptr = {}, correct = True, debug = False):
    for k in tptr:
        tptr[k]['mov'] = tptr[k]['point_spread']/float(tptr[k]['games_played'])
        tptr[k]['srs'] = tptr[k]['mov']
        tptr[k]['oldsrs'] = tptr[k]['srs']
        tptr[k]['sos'] = 0.0
    delta = 10.0
    iters = 0
    while ( delta > 0.001 ):
        iters += 1
        if iters > 10000:
            return True
        delta = 0.0
        for k in tptr:
            sos = 0.0
            for g in tptr[k]['played']:
                sos += tptr[g]['srs']
            sos = sos/tptr[k]['games_played']
            tptr[k]['srs'] = tptr[k]['mov'] + sos
            newdelta = abs( sos - tptr[k]['sos'] )
            tptr[k]['sos'] = sos
            delta = max( delta, newdelta )
        for k in tptr:
            tptr[k]['oldsrs'] = tptr[k]['srs']
    if correct:
        srs_correction( tptr )
    if debug:
        print("iters = {0:d}".format(iters))
    return True     

year = "2012"
userpass = "username:password"

nfl = "mysql+pymysql://" + userpass + "@localhost/nfl_" + year
engine = create_engine(nfl)

Base = declarative_base(engine)
metadata = Base.metadata

class Game(Base):
    __tablename__ = 'games'
    id = Column(Integer, primary_key=True)
    week = Column(Integer, nullable=False)
    visitor = Column(String(80))
    visit_score = Column(Integer, nullable=False)
    home = Column(String(80))
    home_score = Column(Integer, nullable=False)

Session = sessionmaker(bind=engine)
session = Session()
res = session.query(Game).order_by(Game.week).order_by(Game.home)

tptr = {}
for g in res:
#    print("{0:d} {1:s} {2:d} {3:s} {4:d}".format( g.week, g.home, g.home_score, g.visitor, g.visit_score ))
    if g.home not in tptr:
        tptr[g.home] = {}
        tptr[g.home]['games_played'] = 1
        tptr[g.home]['point_spread'] = g.home_score - g.visit_score
        tptr[g.home]['played'] = [ g.visitor ]
        tptr[g.visitor] = {}
        tptr[g.visitor]['games_played'] = 1
        tptr[g.visitor]['point_spread'] = g.visit_score - g.home_score
        tptr[g.visitor]['played'] = [ g.home ]

    else:
        tptr[g.home]['games_played'] += 1
        tptr[g.home]['point_spread'] += (g.home_score - g.visit_score)
        tptr[g.home]['played'] += [ g.visitor ]
        tptr[g.visitor]['games_played'] += 1
        tptr[g.visitor]['point_spread'] += ( g.visit_score - g.home_score )
        tptr[g.visitor]['played'] += [ g.home ]

simple_ranking( tptr )
for k in tptr:
    print("{0:10s} {1:6.2f} {2:6.2f} {3:6.2f}".format( k, tptr[k]['srs'],tptr[k]['mov'], tptr[k]['sos']))

The output was limited to two digits past the decimal and to that two digits past decimal of precision, my results are the same as my Perl code. The routines should look a lot the same. The only real issue is that you have to float one of the numbers when you calculate margin of victory, as the two inputs are integers. Python isn’t as promiscuous in type conversion as Perl is.

Last note. Although we included pprint, at this point we’re not using it. That’s because with the kind of old fashioned debugging skills I have, I use pprint the way a Perl programmer might use Data::Dumper, to look at data structures while developing a program.

Update: the original Doug Drinen post about the Simple Ranking System has a new url. You can now find it here.

Perhaps the most important new thing I note is that Pro Football Reference now has play by play data, and ways to display those data as a CSV format. Creating parsers for the data would be work, but that means that advanced stats are now accessible to the average fan.

In Ubuntu 16.04, PDL::Stats is now a standard Ubuntu package and so the standard PDL installation can be used with my scripts. About the only thing you need to use CPAN for, at this point, is installing Sport::Analytics::SimpleRanking.

At work I use a lot of Python these days. I have not had time to rethink all this into Pythonese. But I’m curious, as the curve fitting tools in Python are better/different than those in Perl.

Football diagrams: Although the Perl module Graphics::Magick isn’t a part of CPAN, graphicsmagick and libgraphics-magick-perl are part of the Ubuntu repositories.

It was yesterday that Nathan Oyler asked me on Twitter if I could rewrite my Perl code to calculate offensive SRS and defensive SRS. Nathan, I believe, is working on a game or a simulation and wanted to be able to calculate these values. I replied, “Do you know how to calculate these?” and, after playing around a little, I can only conclude that the best way to handle this calculation is going to be a matter of debate.

That said, I have a way to calculate these numbers, but first we need a little theory. It starts with Chase Stuart’s comment on the Smart Football blog that these values are related to points for and points against. Given that, and the definition of margin of victory:

MOV(team) = ( “points for” – “points against” ) / games_played(team) = point spread/games_played(team)

We now need to define an average score. This works:

AVG_SCORE = points_for(all teams)/ games_played(all teams)

From these definitions and the hint Chase dropped, we define offensive MOV and defensive MOV this way.

OMOV(team) = ( points_for(team) – games_played(team)*AVG_SCORE ) / games_played(team)

DMOV(team) = ( games_played(team)*AVG_SCORE – points_against(team) ) / games_played(team)

So, rather than plugging in MOV to a SRS linear equation solver, you can plug in offensive MOV and defensive MOV and then you can get numbers that will help you calculate an OSRS and a DSRS.

I say will get you numbers because there a  gotcha, in that whenever you have OSOS and DSOS and they are of opposite sign, then there is no unique solution to the equation

SOS = OSOS + DSOS

as I can choose any constant c and the result

SOS = (OSOS + c) + (DSOS – c)

is also a solution. This kind of linear wandering around, the solver adding arbitrary constants to OSOS and DSOS, happens when you attempt to solve for these equations. The issue is, there is no one obvious solution to this problem, unlike regular SRS where the constraint “sum of all SRS must equal zero” applies.  Now if someone uncovers a constraint, let me know and I’ll be happy to code it. In the absence of such a rule so far, I’ve used this folk rule.

Reduce the magnitude of the OSOS and DSOS terms until the smaller of the two, in terms of absolute magnitude, is zero.

This is straightforward to code. That my solution is not the same as the one in Pro Football  Reference is easy enough to show. If I go to this page, I get these values for the 2007 New England Patriots. If I calculate OMOV and DMOV using my code, we can extract the DSOS and OSOS values for this calculation.

2007 New England Patriots
SRS OSRS DSRS OMOV DMOV OSOS DSOS
20.1 15.9 4.2 15.1 4.6 0.8 -0.4

 

and while my code uses 0.4 and 0 for OSOS and DSOS respectively, the evident values that Pro Football Reference uses are 0.8 and -0.4. All that clear now?

I’m pretty sure my SOS calculation isn’t the same as PFR’s either, as I seen differences in OSRS/DSRS that amount to a point or two. In some cases this occurs when my calc yields same signed OSOS and DSOS values, and in that case, I don’t modify them at all.

The source code I’ve used to do these calculations is given here, as a Perl module. A “snapshot” of the code fragment I use to feed the Perl module source is:

calc_osrs_and_dsrs

typical output is, for the 2007 season:

OSRS-DSRS-2007-First-Cut

And yes, there are plenty of unknowns at this point. PFR has never really given any details of their OSOS/DSOS calculations, or the normalization routines they use. DSRS and OSRS as implemented by them is a “black box”. This implementation may not, in the long run, be the best of them, but it is reasonably well documented.

Update: corrected DMOV definition. Rewritten slightly for clarity.

The recent success of DeMarco Murray has energized the Dallas fan base. Felix Jones is being spoken of as if he’s some kind of leftover (I know, a 5.1 YPC over a career is such a drag), and people are taking Murray’s 6.7 YPA for granted. That wasn’t the thing that got me in the fan circles. It’s that Julius Jones was becoming a whipping boy again, the source of every running back sin there is, and so I wanted to build some tools to help analyze Julius’s career, and at the same time, look at Marion Barber III’s numbers, since these two are historically linked.

We’ll start with this database, and a bit of sql, something to let us find running plays. The sql is:

select down, togo, description from nfl_pbp where season = 2007 and gameid LIKE "%DAL%" and description like "%J.Jones%" and not description LIKE '%pass%' and not description LIKE '%PENALTY on DAL%' and not description like '%kick%' and not description LIKE '%sacked%'

It’s not perfect. I’m not picking up plays where a QB is sacked and the RB recovers the ball. A better bit of SQL might help, but that’s a place to start. We bury this SQL into a program that then parses the description string for the statement “for X yards”, or alternatively, “for no gain”, and adds them all up. From this, we could calculate yards per carry, but more importantly, we’ll calculate run success and we’ll also calculate something I’m going to call a failure rate.

For our purposes, a failure rate is the number of plays that gained 2 yards or less, divided by the total number of running attempts, multiplied by 100. The purpose of the failure rate is to investigate whether Julius, in 2007, became the master of the 1 and 2 yard run. One common fan conception of his style of play in his last year in Dallas is that “he had plenty of long runs but had so many 1 and 2 yards runs as to be useless.” I wish to investigate that.

(more…)

Brian Burke has made available play by play data from 2002 to 2010 here, and it’s available as .CSV files. The files are actually pretty small, about 5 megs for a year’s worth of data. CSV is a convenient format, and the data themselves are well enough organized an Excel or OpenOffice junkie can use the product, and so can those of us who work with SQL databases. The advantage of a SQL database is the query language you inherit. And what we’re going to show is how to embed Brian’s data into a small simple SQLite database (see here for M. Richard Hipp’s site, and here for the Wikipedia article).

SQLite is a tiny SQL engine, about 250 kilobytes in size. That’s right, 250 kilobytes. It’s intended to be embedded in applications, and so it doesn’t have the overhead of an Internet service, the way MySQL and Postgres do. It is extensively used in things like browsers (Firefox), mail clients, and internet metrics applications (Unica’s Nettracker). It has an MIT open source license, and  there are commercial versions of this free product you can buy, if you’re into that thing. Oracle, among others, sells a commercial derivative of this free product.

A SQLite database is a single file, so once you create it,  you could move the file onto a USB stick and carry it around with you (or keep it on your Android phone). The database that results is about 55 megabytes in size, not much different in size from the cumulative .CSVs themselves.

Brian’s data lack a primary key, which is fine for spreadsheets, but creates issues in managing walks through sequential data in a database. We’ll create a schema file (we’ll call it schema.sql) as so:

Use a text editor to create it. With the sqlite3 binary, create a database by saying:


sqlite3 pbp.db
sqlite>.read schema.sql
sqlite>.tables
nfl_pbp
sqlite>.exit

Once that’s all done, we’ll use Perl and the DBI module to load these data into our SQLite table. Loading is fast so long as you handle the transaction as a single unit, with the $dbh->begin_work and $dbh->commit statements.

Once loaded, you can begin using the data almost immediately:

sqlite> select count(*) from nfl_pbp;
384809
sqlite> select distinct season from nfl_pbp;
2002
2003
2004
2005
2006
2007
2008
2009
2010
sqlite> select count( distinct gameid ) from nfl_pbp;
2381

As far as the data themselves go, I’ll warn you that the ydline field is a little “lazy”,  in that if you score a touchdown from the 20, the extra point play and the ensuing kick also “occur” on the 20. So you end up with interesting sql statements like this when you search the data:


sqlite> select count(*) from nfl_pbp where ydline = 1 and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
3370
sqlite> select count(*) from nfl_pbp where ydline = 1 and description like "%touchdown%" and not description like "%extra point%" and not description like "%two-point%" and not description like "%kicks %";
1690

Using the DBI module, or whatever database interface your language supports, you can soon start crunching data towards game outcome probabilities in no time.

The value of a touchdown is a phrase used in formulas like this one

PASSER RANKING = (yards + 10*TDs – 45*Ints)/attempts

where the first thing that comes to mind is that the TD is worth 10 yards and the interception is worth 45 yards. But is it? A TD after all, is worth about 7 points, and in The Hidden Game of Football formulation, a turnover is worth 4 points. Therefore, a TD is worth considerably more than a turnover, but the formula values the TD less. How is that?

Well, let me reassure you that in the new passer rating of the Hidden Game of Football, the value of a touchdown is a constant, equal to 6.8 points or 85 yards. The interception of 4 points is usually valued at 45 yards instead of 50, because most interceptions don’t make it back to the line of scrimmage.

The field itself is zero valued at the 25 yard line. That means once you get to the one yard line, you have one yard to go of field and the TD is worth an additional 10 yards of value. That’s where the 10 comes from. It’s not the value of the touchdown, but the additional value of the touchdown not measured on the field itself.

But what does this additional term actually mean?

Figure 1. The basic linear scoring model of THGF. TD = 6, linear slope = 0.08 points/yard. The probability of a score goes to 1.0 as the goal line is approached.

Figure 2. The model of THGF's new passer rating. The difference between y value at 100 yards and TD equals 0.8 points or 10 yards. Maximum probability of a score approaches 75/85.

If you check out the figures above, Figure 1 is introduced in The Hidden Game  of Football on page 102, and features in just about all the descriptions of worth up until page 186, where we run into this text. The authors appear to be carving out a new formula from the refactored NFL formula they introduce in their book.

Awarding a 80 yard bonus for a touchdown pass makes no sense either. It’s like treating every TD pass as though it were a 80-yard bomb. Yet, the majority of touchdown passes are from inside the 25 yard line.

It’s not the bonus we’re objecting to-after all, the whole point of throwing a pass is to get the ball into the end zone-but the size of the bonus is way out of kilter. We advocate a 10 yard bonus for each touchdown pass. It’s still higher than the yardage on a lot of TD passes, but it allows for the fact that yardage is a lot harder to get once a team gets inside the opponent’s 25.

and without quite saying so, the authors introduce the model in Figure 2. To note, the value of the touchdown and the yardage value merge in Figure 1, but remain apart in Figure 2. This value, which I’ve called a barrier potential previously, is the product of a chance to score that’s less than a 1.0 probability as you reach the goal line.  If your chances maximize at merely 80%, you’ll end up with a model with a barrier potential.

If I have an objection to the quoted argument, it’s that it encourages the whole notion of double counting the touchdown “yardage”. The appropriate way to figure out the slope of any linear scoring model is by counting all scoring at a particular yard line, or within a particular part of the field (red zone scoring, for example, which could  be normalized to the 10 yard line). These are scoring models, after all, not touchdown models.

Where did 6.8 come from, instead of 7?

Whereas before I was thinking  it was 6 points for the TD and 0.8 points for the extra point, I’m now thinking it came from the same notions that drove the score value of 6.4 for Romer and 6.3 for Burke. It’s 7 points less the value of the runback. I’ve used 6.4 points to derive scoring models for PFR’s aya and the NFL passer rating, but on retrospect, those aren’t appropriate uses. These models tend to zero in value around 25 yards, whereas the Romer model has much higher initial slopes and reaches positive values faster than these linear models.

This value can be calculated, but the formula that results can’t be calculated directly. It can be solved iteratively, though, with a pretty short piece of code

Figure 3. Perl code to solve for slope, effective TD value and y value at 100 yards in linear scoring models.

Figure 4. Solving for barriers of 10 and 20 yards.

And the solution is close enough to 6.8 that it’s easy enough to ignore the difference. Plugging 7 points for the touchdown, 20 and 29.1 yards respectively for the barrier potential yields almost no changes in the touchdown value for  the PFR aya model and the NFL passer rating formula, and we end up with these scoring model plots.

Figure 5. PFR aya amended model. TD = 7 points, slope = 0.075 points/yard, y at 100 = 5.5 points.

Figure 6. Amended NFL prf scoring model. TD = 7.05 points, slope = 0.07 points/yard, y at 100 = 5.0 points.

I’ve just started reading this book

and if only for the introduction, people  need to take a look at this book. This quote is pretty important to folks who want to understand how football analytics actually works, as opposed to what people tell you..

The other trick in finding ideas is figuring out the difference between power and knowledge. Of all the people whom you’ll meet in this  volume, very few of them are powerful or even famous. When I said I’m most  interested in minor geniuses, that’s what I mean.   You don’t start at the top if you want the story. You start in the middle, because the people in  the middle who do the actual work in the world….People at the top are self-conscious about what they say (and rightfully so) because they have position and  privilege to protect – and self-consciousness is the enemy of “interestingness”.

The more I read smaller blogs, the more I understand and the better I understand what I’m doing. To note, the Hidden Game of Football is also a worthwhile read, as those guys put a lot of effort into their work, into making it understandable, and a deeper read usually pays off in deeper understanding of concepts.

In Gladwell’s  book, there is a discussion of Nassim Taleb, currently a darling because of his contrarian views about randomness and its place in economics. But more immediately useful as a metaphor is Malcolm’s discussion of ketchup. He makes a strong case that the old ketchup formula endures because it’s hard to improve on.  It has just about  the right amounts of everything in the flavor spectrum to make it work for most people. I’m thinking the old NFL passer rating formula is much like that, though the form of  the equation is a little difficult for most people to absorb. I’ll be touching on ways to look at the passer rating in a much simplified form shortly.

Another story is in order here, the story of the sulfa drugs. To begin, recall that the late 19th century spawned a revolution in organic chemistry, which first manifested in new, colorful dyes. And not just clothing dyes, but also the art of tissue staining. The master of tissue staining back in the day was one Paul Ehrlich, who from his understanding of staining specific tissues, came up with  the notion of the “magic bullet”. In other words, find a stain that binds specifically to pathogens, attach a poison to the stain, and thereby selectively kill bacteria and other pathogens. His drug Salvarsan was the first modern antibacterial and his work set the stage for more sophisticated drugs.

Bayer found  the first of the new drugs, protonsil, by examining coal-tar dyes. However it only worked in live animals. A French team later found that in the body, the drug was cleaved into two parts, a medically inactive dye, and a medically active and colorless drug  that later became known as sulfanilamide. The dye portion of the magic bullet was unnecessary. Color wasn’t necessary to make the drug “stick”.

When dealing with formulas, you need to figure out ways to cut  the dye out of the equation, reduce formulas to their essence. Mark Bittman does that with recipes, and his Minimalist column in the Times is a delight to read. And  in football, needless complication just gets in the way. Figure it out, and then ruthlessly simplify it. And I suspect that’s the best path to  understanding why certain old formulas still have functional relevance in modern times.

Update: added link to new article. Fixed mixing of phrases silver bullet and magic bullet

Next Page »