After watching one or another controversy break out during the 2011 season, I’ve become convinced that the average “analytics guy” needs a source of play-by-play data on a weekly basis. I’m at a loss at the moment to recommend a perfect solution. I can see the play-by-play data on NFL.com, but I can’t download it. Worst case, you would think you could save the page and get to the data, but that doesn’t work. I suspect the use of AJAX or equivalent server side technology to write the data to the page after the HTML has been presented. Good for business, I’m sure, but not good for Joe Analytics Guy.
One possible source is now Pro Football Reference (PFR), which now has play by play data in their box scores, and has tended to present their data in AJAX free, user friendly fashion. Whether Joe Analytics Guy can do more than use those data personally, I doubt. PFR is purchasing their raw data from another source. And whatever restrictions the supplier puts on PFR’s data legally trickle down to us.
Further, along with the play by play, PFR is now calculating expected points (EP) along with the play by play data. Thing is, what expected point model is Pro Football Reference actually using? Unlike win probabilities, which have one interpretation per data set, EP models are a class of related models which can be quite different in value (discussed here, here, here). If you need independent verification, please note that Keith Goldner now has published 4 separate EP models (here and here), his old Markov Chain model, the new Markov Chain model, a response function model, and a model based on piecewise fits.
That’s question number one. Question that have to be answered to answer question one are things like:
- How is PFR scoring drives?
- What is their value for a touchdown?
- If PFR were to eliminate down and distance as variables, what curve do they end up with?
This last would define how well Pro Football Reference’s own EP model supports their own AYA formula. After all, that’s what a AYA formula is, a linearized approximation of a EP model where down and to go distance are ignored, with yards to score is the only independent variable.
|Representative Pro Football Reference EP Values|
|1 yard to go||99 yards to go|
My recommendation is that PFR clearly delineate their assumptions in the same glossary where they define their version of AYA. Make it a single click lookup, so Joe Analytics Guy knows what the darned formula actually means. Barring that, I’ve suggested to Neil Paine that they publish their EP model data separately from their play by play data. A blog post with 1st and ten, 2nd and ten, 3rd and ten curves would give those of us in the wild a fighting chance to figure out how PFR actually came by their numbers.
Update: the chart that features 99 yards to go clearly isn’t 1st and 99, 2nd and 99. Those are 1st and 10 values, 2nd and 10, etc at the team’s 1 yard line. The only 4th down value of 2011, 99 yards away, is a 4th and 13 play, so that’s what is reported above.