CurveBall: Playing a Baseball Season on my Laptop

Even though this post will be of limited interest (at best), it is necessary for me to both document what I've done in my simulations and to prove that the calculations I've done are robust and pass various tests. I'm going to document one straightfoward test here.

The code works just like an actual game (only without pitching); each game has nine innings, each inning has three outs, and each player up can either reach base or create an out. Outs are either strikeouts or in-play outs. Reaching base can happen by walks, hits or errors. The rate at which a player does all these things depends on stats that are read in from some input file. Usually I use real players with stats taken from FanGraphs, but sometimes it's interesting to play around with things. Players on base advance on balls in play, depending on whether the out is a ground ball or a fly ball (which are also part of the input statistics).

The code takes into account baserunning in four ways: (1) taking the extra base: there isn't a whole lot of player-to-player variance in the number of times a player takes the extra base relative to the league average, but it matters enough that it needs to be taken into account. There is a tail of bad baserunners out there that can have a significant impact on their teams. (2) Relatedly, double plays are important. The league average for double player per DP-opportunity is 11% in 2013, but this depends a lot on ground ball rate and strikeouts, as well as the number of opportunities a player is given to hit into one. Mike Moustakas actually leads the Royals of staying out the DP given the number of chances he's had, but that's because it's hard to double-up the guy on first when fielding a pop-up. So the code uses the fraction of times a ground-ball out in a DP-situation yields a GIDP (as well as the player's GB rate and the average BABIP for grounders). This is where I had to go to Retrosheet. (3) Base stealing. Not that this really matters much, but players above a given threshold of stolen base attempts are flagged as "base stealers" and they attempt and succeed at their previous rate. (4) Reaching base on error. (Not really baserunning per se, but I'll just stick it here anyway.) This is a statistically significant source of base runners, and it's frustrating that FanGraphs does not track it nor include it in their wOBA calculation, even though Tango et al's The Book puts it in theirs. This is another thing that depends on ground ball rate (since this is where the errors happen) and speed of the batter. Right now I only have this depending on GB%, but I should add considerations for player handedness and overall speed.

The test I'm presenting here is that I reproduce the actual run expectancy (RE) of each of the 24 base/out states possible. This is presented in the very first chapter of The Book, and forms the basis of a lot of the analysis presented thereafter and of many advanced statistics. The bases can be empty or occupied in various ways;

Bases empty
Runner on first
Runner on second.
Runner on third.
First and second.
First and third.
Second and third.
Bases loaded.

And those 8 states can happen with 0, 1 or 2 outs. Thus 24 total states. The RE is defined as the average number of runs from that point in the inning until the end. The RE value (0 out, bases empty) is simply the average number of runs scored per inning. The Book was written using statistics from the height of the steroid era (1999-2002), so their numbers are a bit higher than we see in MLB today. But at that time the average team scored just under 5 runs a game, of 0.55 runs per inning. So that's the first RE value. If there are 0 outs and a man on first, the RE goes up. Man on second, goes up again. I think you get the idea, and you can see all the values from The Book (table 1) in the figure below.

To simulate this, I took all players' MLB number from 1999-2002 who had more than 200 plate appearances total over that span, and created 10,000 random lineups (batting order sorted by wOBA---I have also tried to make "MLB-like" lineups, but it doesn't change the results enough to matter for this test) and ran them through the simulator. Essentially, there is one free parameter in the code: getting the first RE number right: the mean runs scored per inning. This is where the cutoff in plate appearances comes in: the lowest-PA players are all replacement (or worse). Bringing more of them into the simulated lineups lowers the overall offense of my virtual league. So once I figure out what players to use to get the right mean number of runs, then I can look at the rest of the states.

Results are quite encouraging. Additionally, I should point out that the code produces the right number runs with the right number of hits, walks, errors, double plays, sacrifice flies, etc etc etc. This is an important check as well, since if my virtual league scored the same runs as MLB did in this era but had different batting statistics, something would be wrong. I'll give more rigorous tests in the future, once I am infused with the desire to write more boring posts.

CurveBall

Friday, January 10, 2014

Playing a Baseball Season on my Laptop

No comments:

Post a Comment

Jeremy Tinker