Projecting The 2024 NCAA Tournament Field
Well folks, post number one, wish me luck!
We are entering into the greatest time of year to be a sports fan, March Madness! It's the time of year where millions of people spend an unhealthy amount of time trying to fill out the perfect bracket and win their bracket pool, only to lose to the eighty-year-old office receptionist who picked teams based on mascot names. Naturally, being a data scientist and someone mildly obsessed with both statistics and sports, trying to predict the NCAA tournament and fill out brackets makes me more excited than a kid in a candy store. Fortunately, with my beloved Syracuse Orange more than likely watching this tournament from home, this tournament will still be fun for me as I will most certainly be nerding-out on all the numbers and stats trying to fill out my bracket(s). The fun begins before Selection Sunday though, it trying to figure out WHO in fact will be among the 68 teams with a chance to cut down the nets (let's be honest though, it's less than 68 that actually have a remotely reasonable chance...sorry Colgate). I have pulled together my projected field of 68 heading into selection weekend. Disclaimer, this was finalized on Saturday morning 3/16 (I may update it Sunday before the selection show, but no promises), so for conferences that have not crowned a champion, I have inserted in the highest-ranking team still in play.
How I came to my calculations
As you know by now, this is both a sports and statistics site (well, really the statistics OF sports), so naturally this was a result of statistical algorithms. Using data from 855 NCAA basketball games played this season, by teams of all calibers and not skewed to the better or more competitive games, I created a multiple regression model via ridge-regression method to forecast game results (specifically if the home team won or lost). Neutral-site games were omitted to not skew the home-court advantage, as I did not have a sufficient number of neutral-site games to be able to create a variable signifying a neutral game and incorporate that into modeling. Once the most important variables for game result were determined, I performed relative importance calculation to get "weights" per-say of each of those variables on the expected outcome. This modeling was performed in R and relative importance calculation was done using the "calc.relaimp()" function in the "relaimpo" package. Without getting to in the weeds here, the relative importance weights were used to calculate a power ranking of each team based on the z-scores of their values for each one of the variables identified. Using the z-score puts all the values on the same scale, with a mean of 0 and standard deviation of 1, as most of the variables had differing numerical scales. The teams were seeded based on their power rankings, taking into consideration the automatic qualifiers from the lower-tier conferences HAD to be in even though most had a far lower power rating. Here is my projected field:
Now, I did not take into consideration WHERE teams would appear in the bracket, just seeding, because aside from the overall number one seed, I feel like there is really zero rhyme or reason to what region teams end up in (throwback to Syracuse in Salt Lake City and Chicago). So really all you need to pay attention to is the seed line given to each team.
I'll definitely be revisiting this Sunday night to see how close this came to the real committee selection, and how I did compared to the "Bracketologist" himself, Joe Lunardi (to be honest, is he really all that accurate anyway?)
If you've read this far, thank you for having absolutely nothing better to do. I hope you continue to read my blog.