Modeling


Summary: Replacing Wentz with Foles removes about 6.5 points of offense from the Philadelphia Eagles, turning a high flying offense into something very average.

Last night the Atlanta Falcons defeated the LA Rams. Now we’re faced with the prospect of the Falcons playing the Eagles. I have an idiosyncratic playoff model, one I treat as a hobby. It is based on static factors, the three being home field advantage, strength of schedule, and previous playoff experience. And since it values the Eagles as 0.444 and the Falcons as 1.322, the difference is -0.878 (win probability in logits). The inverse logit of -0.878 is 0.294, which is the probability of the Eagles winning, and an estimated point spread would be a 6.5 point advantage for the Falcons.

Another question that a Falcons or Eagles fan might have is how much is Carson Wentz worth as a QB, in points scored? We can use the adjusted yards per attempt stat of Pro Football Reference to estimate this, and also to estimate how much Carson Wentz is better than Foles. We have made these kinds of analyses before for Matt Ryan and Peyton Manning.

Pro Football Reference says that Carson Wentz has a AYA of 8.3 yards per attempt. Nick Foles has a AYA of 5.4. Now lets calculate the overall AYA for every pass thrown in the NFL. Stats are from Pro Football Reference.

(114870 yards + 20*741 TDs – 45*430 Ints) / 17488 Attempts
(114870 yards + 14820 TD “yards” – 19350 Int “yards”) / 117488 Attempts
110340 net yards / 17488 yards
6.31 yards per attempt to three significant digits

So about 6.3 yards per attempt. Carson Wentz is 2 yards per attempt better than the average. Nick Foles is 0.9 yards less than the average. The magic number is 2.25 which converts yards per attempt to points scored per thirty passes. So Carson, compared to Foles, is worth 2.9 * 2.25 = 6.5 points per game more than Foles, and 4.5 points more than the average NFL quarterback.

This doesn’t completely encompass Carson Wentz’s value, as according to ESPN
‘s QBR stat
, he account for 10 expected points on the ground in 13 games, so he nets about 0.8 points a game on the ground as well.

Now, back to some traditional stats. The offensive SRS assigned to Philadelphia by PFR is 7.0 with a defensive SRS of 2.5. If Carson Wentz is worth between 6.5 and 7.3 points per game, then it reduces Philadelphia’s offense to something very average, about 0.5 to -0.3. That high flying offense is almost completely transformed by the loss of their quarterback into an average offense.

Note: logits are to probabilities as logarithms are to multiplication. Rather than multiplying probabilities and using transitive rules, you just add the logits and convert back. Logarithms allow you to add logarithms of numbers rather than multiplying them.

Advertisements

One of the ESPN folks posted FPI odds today, retweeted by Ben Alamar. The numbers are very different from my playoff formulas. The nature of those odds made me suspect that FPI is intrinsically an offensive stat, with the advantages and disadvantages of such a stat.

One of the issues I’ve has with offensive stats is that the confidence interval of any I’ve looked at, in terms of predicting playoff performance, is that those confidence intervals are on the order of 85%. Whatever flaws of my formulas, they fit to confidence intervals of 95%. The effects they touch on are real.

But still, the purpose of this is to compare FPI odds to the odds generated by some common offensive stats. We’re using Pythagorean expectation, as generated by my Perl code, SRS as generated by my Perl code, and median point spread, also calculated by my code.

Results are below.

FPI Odds versus Other Offensive Stats
Game FPI Pythag Simple Ranking Median Pt Spread
Kansas City – Tennessee 0.75 0.75 0.79 0.73
Jacksonville – Buffalo 0.82 0.89 0.86 0.73
Los Angeles – Atlanta 0.62 0.75 0.74 0.68
New Orleans – Carolina 0.70 0.73 0.74 0.78

 

The numbers correlate too well for FPI not to have a large offensive component in its character. In fact, Pythagorean odds correlate so well with FPI I’m hard pressed to know what advantages FPI gives to the generic fan.

Note: the SRS link above points out that PFR has added a home field advantage component to their SRS calcs. I’ll note that our SRS was calibrated against PFR’s pre 2015 formula.

 

This question came up when I was looking up the last year in the playoffs for seven probable NFC playoff teams. Both New Orleans and Philadelphia last played in the playoffs four years ago, in 2013. And then the thought came up in my head, “But Drew Brees is a veteran QB.” This seems intuitive, but wanting to actually create such a definition and then later to test this using a logistic regression, there is the rub.

There are any number of QBs a fan can point to and see that the QB mattered. Roger Staubach seemed a veteran in this context back in the 1970s, Joe Montana in the 1980s, Ben Roethlisberger in the 21st century, Eli Manning in 2011, and Aaron Rogers last year. But plenty of questions abound. If a veteran QB is an independent variable whose presence or absence changes the odds of winning a playoff game, what tools do we use to define such a person? What tools would we use to eliminate entanglement, in this case between the team’s overall offensive strength and the QB himself?

The difference between a good metric and a bad metric can be seen when looking at the effect of the running game on winning. The correlation between rushing yards per carry and winning is pretty small. The correlation between run success rate and winning are larger. In short, being able to reliably make it on 3rd and 1 contributes more to success than running 5 yards a carry as opposed to 4.

At this point I’m just discussing the idea. With a definition in mind, we can do one independent variable logistic regression tests. Then with a big enough data set – 15 years of playoff data should be enough, we can start testing three independent variable logistic models (QB + SOS + PPX).

I’ve been curious, since I took on a new job and a new primary language at work, to what extent I could begin to add Python to the set of tools that I could use for football analytics. For one, the scientific area where the analyst needs the most help from experts is in optimization theory and algorithms, and at this point in time, the developments in Python are more extensive than Perl.

To start you have the scipy and numpy packages, with scipy.optimize having diverse tools for minimization and least squares fitting.  Logistic regressions in python are discussed here,  and lmfit provides some enhancements to the fitting routines in scipy.  But to start we need to be able to read and write existing data, and from that then write the SRS routines. The initial routines were to be based on my initial SRS Perl code, so don’t be surprised if code components looks very familiar.

This code will use an ORM layer, SQLAlchemy, to get to my existing databases, and to create the Class used to fetch the data, we used a python executable named sqlacodegen. We set up sqlacodegen in a virtual environment and tried it out.  The output was:

# coding: utf-8
from sqlalchemy import Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
metadata = Base.metadata

class Game(Base):
    __tablename__ = 'games'

    id = Column(Integer, primary_key=True)
    week = Column(Integer, nullable=False)
    visitor = Column(String(80))
    visit_score = Column(Integer, nullable=False)
    home = Column(String(80))
    home_score = Column(Integer, nullable=False)

Which, with slight mods, can be used to read my data. The whole test program is here:

from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from pprint import pprint

def srs_correction(tptr = {}, num_teams = 32):
    sum = 0.0
    for k in tptr:
        sum += tptr[k]['srs']
    sum = sum/num_teams
    for k in tptr:
        tptr[k]['srs'] -= sum
        tptr[k]['sos'] -= sum 

def simple_ranking(tptr = {}, correct = True, debug = False):
    for k in tptr:
        tptr[k]['mov'] = tptr[k]['point_spread']/float(tptr[k]['games_played'])
        tptr[k]['srs'] = tptr[k]['mov']
        tptr[k]['oldsrs'] = tptr[k]['srs']
        tptr[k]['sos'] = 0.0
    delta = 10.0
    iters = 0
    while ( delta > 0.001 ):
        iters += 1
        if iters > 10000:
            return True
        delta = 0.0
        for k in tptr:
            sos = 0.0
            for g in tptr[k]['played']:
                sos += tptr[g]['srs']
            sos = sos/tptr[k]['games_played']
            tptr[k]['srs'] = tptr[k]['mov'] + sos
            newdelta = abs( sos - tptr[k]['sos'] )
            tptr[k]['sos'] = sos
            delta = max( delta, newdelta )
        for k in tptr:
            tptr[k]['oldsrs'] = tptr[k]['srs']
    if correct:
        srs_correction( tptr )
    if debug:
        print("iters = {0:d}".format(iters)) 
    return True     


year = "2001"
userpass = "user:pass"

nfl = "mysql+pymysql://" + userpass + "@localhost/nfl_" + year
engine = create_engine(nfl)

Base = declarative_base(engine)
metadata = Base.metadata


class Game(Base):
    __tablename__ = 'games'
    id = Column(Integer, primary_key=True)
    week = Column(Integer, nullable=False)
    visitor = Column(String(80))
    visit_score = Column(Integer, nullable=False)
    home = Column(String(80))
    home_score = Column(Integer, nullable=False)

Session = sessionmaker(bind=engine)
session = Session()
res = session.query(Game).order_by(Game.week).order_by(Game.home)

tptr = {}
for g in res:
#    print("{0:d} {1:s} {2:d} {3:s} {4:d}".format( g.week, g.home, g.home_score, g.visitor, g.visit_score ))
    if g.home not in tptr:
        tptr[g.home] = {}
        tptr[g.home]['games_played'] = 1
        tptr[g.home]['point_spread'] = g.home_score - g.visit_score
        tptr[g.home]['played'] = [ g.visitor ]
        tptr[g.visitor] = {}
        tptr[g.visitor]['games_played'] = 1
        tptr[g.visitor]['point_spread'] = g.visit_score - g.home_score
        tptr[g.visitor]['played'] = [ g.home ]
 
    else:
        tptr[g.home]['games_played'] += 1
        tptr[g.home]['point_spread'] += (g.home_score - g.visit_score)
        tptr[g.home]['played'] += [ g.visitor ]
        tptr[g.visitor]['games_played'] += 1
        tptr[g.visitor]['point_spread'] += ( g.visit_score - g.home_score )
        tptr[g.visitor]['played'] += [ g.home ]

simple_ranking( tptr )
for k in tptr:
    print("{0:10s} {1:6.2f} {2:6.2f} {3:6.2f}".format( k, tptr[k]['srs'],tptr[k]['mov'], tptr[k]['sos']))

The output was limited to two digits past the decimal and to that two digits past decimal of precision, my results are the same as my Perl code. The routines should look a lot the same. The only real issue is that you have to float one of the numbers when you calculate margin of victory, as the two inputs are integers. Python isn’t as promiscuous in type conversion as Perl is.

Last note. Although we included pprint, at this point we’re not using it. That’s because with the kind of old fashioned debugging skills I have, I use pprint the way a Perl programmer might use Data::Dumper, to look at data structures while developing a program.

Update: the original Doug Drinen post about the Simple Ranking System has a new url. You can now find it here.</

Odds for the 2015 NFL playoff final, presented from the AFC team’s point of view:

SuperBowl Playoff Odds
Prediction Method AFC Team NFC Team Score Diff Win Prob Est. Point Spread
C&F Playoff Model Denver Broncos Carolina Panthers 2.097 0.891 15.5
Pythagorean Expectations Denver Broncos Carolina Panthers -0.173 0.295 -6.4
Simple Ranking Denver Broncos Carolina Panthers -2.3 0.423 -2.3
Median Point Spread Denver Broncos Carolina Panthers -5.0 0.337 -5.0

 

Last week the system went 1-1, for a total record of 6-4. The system favors Denver more than any other team, and does not like Carolina at all. Understand, when a team makes it to the Super Bowl easily, and a predictive system gave them about a 3% chance to get there in the first place, it’s reasonable to assume that in that instance, the system really isn’t working.

So we’re going to modify our table a little bit and give some other predictions and predictive methods. The first is the good old Pythagorean formula. We best fit the Pythagorean exponent to the data for the year, so there is good reason to believe that it is more accurate than the old 2.37. It favors Carolina by a little more than six points. SRS directly gives point spread, which can be back calculated into a 57.7% chance of Carolina winning. Likewise, using median point spreads to predict the Denver-Carolina game gives Carolina a 66.3% chance of winning.

Note that none of these systems predicted the outcome of the Carolina – Arizona game. Arizona played a tougher schedule and was more of a regular season statistical powerhouse than Carolina. Arizona, however, began to lose poise as it worked its way through the playoffs. And it lost a lot of poise in the NFC championship game.

Odds for the third week of the 2015 playoffs, presented from the home team’s point of view:

Conference Championship Playoff Odds
Home Team Visiting Team Score Diff Win Prob Est. Point Spread
Carolina Panthers Arizona Cardinals -1.40 0.198 -10.4
Denver Broncos New England Patriots 1.972 0.879 14.6

 

Last week the system went 2-2, for a total record of 5-3. The system favors Arizona markedly, and Denver by an even larger margin. That said, the teams my system does not like have already won one game. There have been years when a team my system didn’t like much won anyway. That was the case in 2009, when my system favored the Colts over the Saints. The system isn’t perfect, and the system is static. It does not take into account critical injuries, morale, better coaching, etc.

Odds for the second week of the 2015 playoffs, presented from the home team’s point of view:

Second Round Playoff Odds
Home Team Visiting Team Score Diff Win Prob Est. Point Spread
Carolina Panthers Seattle Seahawks -1.713 0.153 -12.7
Arizona Cardinals Green Bay Packers -0.001 0.500 0.0
Denver Broncos Pittsburgh Steelers 0.437 0.608 3.2
New England Patriots Kansas City Chiefs -0.563 0.363 -4.2

 

Last week the system went 3-1 and perhaps would have gone 4-0 if after the Burflict interception, Cincinnati had just killed three plays and kicked a field goal.

The system currently gives Seattle a massive advantage in the playoffs. It says that Green Bay/Arizona is effectively an even match up, and that both the AFC games are pretty close. It favors Denver in their matchup, and the Chiefs in theirs.

One last comment about last week’s games. The Cincinnati-Pitt game was the most depressing playoff game I’ve seen in a long time, both for the dirty play on both sides of the ball, and the end being decided by stupid play on Cincinnati’s part.  It took away from the good parts of the game, the tough defense when people weren’t pushing the edges of the rules, and the gritty play on the part of McCarron and Roethlisberger. There was some heroic play on both their parts, in pouring rain.

But for me, watching Ryan Shazier leading with the crown of his helmet and then listening to officials explain away what is obvious on video more or less took the cake. If in any way shape or form, this kind of hit is legal, then the NFL rules system is busted.

Next Page »