Introducing Beyond the Score's MLB Game Forecast Model
Voila! I present to you my personal model for individual MLB games.
I think I can speak for all statisticians and data scientists when I say that I love the art of making predictions. I’ve always been fascinated by the prospect of forecasting the outcome of sporting events, and have long tinkered with building predictive models. I like to think I am finally putting one to good use here. I have been working this season on building a model to forecast the win probability of MLB games and - for those who dabble in the sports betting scene - compare model win probability projections to the win probability implied by the odds set by sportsbooks. Doing so, assuming you have an effective model, can help you spot opportunities and potential overvaluations in the betting market. Without getting too much in the weeds, I will give a brief rundown of the model and process I took to arrive at the finished product.
About the Model
The model is based on a logistic regression procedure, with the statistics included based in a statistical variable selection process. The full methodology behind the model is included at the end of this post, and I strongly encourage you to read it. Below are the stats included in the model, along with their definition from the FanGraphs library.1 All stats, except for run differential, include both home and away team numbers.
I calculated the variable importance of the model using the `varImp()` function from the caret package in R. Below you can see which variables are most and least impactful in making the model’s predictions.
Using the Model
The model has seven basic elements:
Predicted winner
Win probability for the predicted winner
Corresponding betting odds for the winner - based on modeled win probability (called “implied odds”)
Betting favorite (currently per DraftKings)
Moneyline odds for the betting favorite (also currently via DraftKings)
Win probability corresponding to the moneyline odds (called “implied probability”)
Value - representing the relative difference between the modeled implied odds and sportsbook moneyline odds.
All of these elements are displayed in the model’s output, which you can find on the new “MLB Predictions” tab of the Beyond the Score homepage. Below is what it looks like.
The ‘Value’ piece is an important element here, as it shows which games may be the most over or under-valued in the betting markets. Games with a high value represent the best betting opportunity, while low/negative values represent games in which the winner predicted by the market may be overvalued.
What’s Next?
While I simply do not have capacity to consistently keep this updated on a daily basis, I plan to update the model with game predictions several times a week, especially on weekends and days when there are many and/or high-profile games. Whenever I do update the model, I will give my top picks and best-bets for the day, along with the number of units I’d wager on the game based on my confidence.2 To ensure you get all my updates and picks, be sure to subscribe! I will also be incorporating a performance tracker, where I will track the model’s performance and my best-bet picks. This is not yet implemented, but you will be alerted when it is. A lot of work went into this, and I hope you find my model useful and interesting!3
The Methodology
Data
Data was collected from over 450 games starting near the beginning of the 2024 season. Some supplemental data from prior seasons to examine trends was also used This data was attained from a variety of sources including FanGraphs, Baseball Reference, Retrosheet, and Baseball Prospectus. Over 50 statistics were either considered in modeling or used to construct a new statistic, and include team-level batting, pitching, fielding, home/road win rates, individual starting pitching, and result-based run differential. WAR statistics at the individual and team-level, which have FanGraphs preseason and in-season projections via their proprietary ZiPS projection system, reflect a weighted average of projected and actual. These stats are continuously adjusted with each game played to weigh projections more heavily at the beginning of the season and realized numbers more heavily towards the end of the season. For example, through 54 games a team’s Batting WAR would be approximately 33% its actual Batting WAR and 67% what FanGraphs projected for them for through 54 games. For pitchers, the calculation is similar but instead of weighting by team games, games pitched (actual and projected) are used.
Analysis
The analysis was performed using the open-source statistical software R. I employed a stepwise regression procedure for variable selection to whittle-down all the data to only use what was most impactful. The set-up of the regression was to predict the run differential of each game in the data (a positive run differential signaled a win by the home team, while a negative run differential meant the road team won). The stepwise regression gave a good set of predictor variables, which I then supplemented with a couple additional and/or complementary variables based on personal baseball knowledge and to balance out home team and away team data. Then, using a 67% to 33% train-test split, I ran a logistic regression model to predict the outcome of the game. Being that logistic regression requires a binary outcome, I created a new outcome variable, “Result”, which reflected a positive or negative run differential (home team won or lost, respectively). The results of the model would generate the probability of the home team winning in each game in the data. I then incorporated some home-field advantage bias towards the home team, based upon the fact that the home team has statistically a 53% chance of winning a given game (shameless plug for my recent article on home field advantage). Testing the model on the hold-out set suggested approximately 64% accuracy at a 95% confidence interval of 55.4%-71.3%. While it would be nice to see 90%+ accuracy, I consider 64% to be fairly decent in Major League Baseball, which has far-and-away the most parity of any major sport - especially this season. Hey, a winning percentage of 64% would be the best in baseball right now - beating out the Phillies by over two games! Also, this is likely the very worst the model will ever be. I will be continuously calibrating it as more game are played and new data comes in (frankly, a sample size of 450 games is not a whole lot in such a highly variable sport). I will also make tweaks here and there, especially between seasons, in an effort to improve the model. I will keep you loyal readers informed on any model tweaks and/or re-calibrations.
Feel free to drop a comment or message me with any questions or thought on the model. I’d love to hear your input!
For an overview of unit-based betting, see: What Is A Unit In Sports Betting? How To Build A Bankroll – Forbes Betting
This is intended for entertainment purposes only. I claim no responsibility on any financial gains or losses that may be incurred based on information attained from this Substack if used for gambling purposes.