top of page

StatsDetective 3 | predicting the MLB regular season

  • Jonah Vega-Reid
  • Jun 19, 2024
  • 2 min read

This week we attempted to do a better job than last week's abject predictive failure. To be fair, after the finals were over, my algorithm was 57% accurate so a bit better than I thought it would be. My success for MLB games has been steadily improving with each new test, and that is promising. I am able to get around 60% accuracy with a myriad of variable combinations, and those are proving to be consistent.


As detailed in the previous video, the subset of variables available to me for prospective prediction is very limited. Among those, there are basically two groups: performance from the past (career or previous season), and performance up to that point in the season. Both are somewhat tricky to conceptualize and put into useable form. For past performance, there are issues with missing data in the case of rookies or guys who just didn't play a lot of games or at all. The question of imputation is real and how to handle the missingness becomes questionable.


Current season performance, on the other hand, is an entirely different beast. Not only do you need to calculate running totals and proportions meticulously, but the question of where to start comes up. For instance, with ERA, a pitcher might start the season with a good game and have an unrealistically low ERA for a while. But when he has an inevitable stinker, the prediction is totally wrong, even though he is probably just regressing to his mean. I have gone with a very basic rule of thumb which is to get rid of the first 10 games for every team. This seems to be about where winning percentages and run differential normalize, but of course, this approach could be throwing out data that is useful.


Lastly, we have the breakthrough. Since I am using logistic regression, I am able to calculate predicted win probabilities easily, and by only looking at games that have sufficiently high or low probabilities, I can get as high as 78% success. This sounds great and should theoretically be able to turn a profit when it comes to betting, but the odds may be such that only ridiculously low-odds games fit the bill. I foresee that this could be an issue, but for now, I am hopeful the approach will work.




Comments


bottom of page