FiveThirtyEight’s NBA Carmelo Predictions

I’ve been following FiveThirtyEight’s NBA predictions for the last few months, and I saw today that they make their data available for download. One problem I have with the table on their website is that it doesn’t have a way to get a nice East/West predicted finish table. Their predictions run on a model they call Carmelo, which you can read about on their website. Today I thought I’d take a look at their data, analyze their prediction’s accuracy, compare it to other websites, get their predicted finish in a nice table, and, last, run a nice long simulation of predicted finishes.

First thing is to download the data. The great thing is this data is constantly updated so this analysis can be rerun anytime you’d like.

# load packages
if (!is.element('pacman', installed.packages()[,1])) install.packages('pacman', dep = TRUE)
pacman::p_load(rio,tidyverse,rvest,formattable)
# download Carmelo data
carmelofull <- import("https://projects.fivethirtyeight.com/nba-model/nba_elo.csv")
carmelodf <- carmelofull %>% filter(season == 2019)  %>% 
  select(1,2,5,6,20:24) 

Now we can take a look at how often their predictions have been right so far this season.

# take a look at this season's prediction success
carmelo2019 <- carmelodf %>% 
  filter(!is.na(score1)) %>% 
  mutate(win1 = score1 > score2,
         win2 = score1 < score2,
         predicted = carmelo_prob1 > carmelo_prob2,
         correct = win1 == predicted)

# get the total
print(sum(carmelo2019$correct) / nrow(carmelo2019))
## [1] 0.6553229

So they’re right about two out of three games. That doesn’t seem too bad, but certainly doesn’t mean they’ll get everything right. What’s missing is perspective. Let’s take a look at how other predictions have performed.

Conveniently, thePredictionTracker.com has already done this for us. We can scrape the web table using the below code, and then take a look at the accuracies.

# download table of aggregated NBA predictions
url <- "http://www.thepredictiontracker.com/nbaresults.php"
xpath <- '//*[@id="content_wrapper"]/center/center/div[1]/div/div/table'
predictions <- url %>%
  read_html() %>%
  html_nodes(xpath = xpath) %>%
  html_table()
predictions <- as_tibble(predictions[[1]])
predictions <- predictions %>% 
  arrange(desc(`Straight Up`))
formattable(predictions)
Sytem Straight Up ATS Absolute Error Bias Mean SquareError Straight UpWins Straightup Losses ATS Wins ATS Losses
Opening Line 0.67582 0.47912 10.02710 -0.50000 167.470 738 354 413 449
Line 0.67037 . 9.82762 -0.45876 162.108 724 356 . .
System Average 0.66489 0.50515 10.14760 -0.26048 171.072 748 377 539 528
ESPN BPI 0.66154 0.51198 10.12720 0.02468 170.827 731 374 534 509
Talisman Red 0.66132 0.52399 10.28930 -0.24057 175.377 742 380 557 506
Sagarin Golden Mean 0.66044 0.49342 10.24120 -0.39212 172.603 743 382 525 539
Sagarin Rating 0.65956 0.49953 10.20820 -0.39724 172.173 742 383 532 533
Powerrankings.com 0.65689 0.50765 10.18820 1.08133 173.493 739 386 531 515
Sagarin Predictor 0.65689 0.50141 10.24240 -0.37231 172.768 739 386 535 532
StatFox 0.65600 0.50940 10.38400 -0.02081 178.780 738 387 515 496
RoundTable 0.65564 0.48845 10.43480 0.04183 178.491 674 354 444 465
Dokter Entropy 0.65422 0.48735 10.26950 -1.85500 174.807 736 389 520 547
Massey Ratings 0.65244 0.49577 10.28270 -0.02758 175.545 734 391 528 537
Sonny Moore 0.64978 0.50188 10.34360 -0.62353 178.033 731 394 535 531
ComPughter Ratings 0.64978 0.50470 10.44420 0.35868 178.318 731 394 537 527
D Ratings.com 0.64502 0.49570 10.29010 -0.03157 174.184 725 399 519 528
Logical Approach 0.64411 0.49324 10.32120 -1.87453 174.928 514 284 365 375
Sagarin Recent 0.64089 0.49343 10.63640 -0.46798 186.269 721 404 526 540
Dunkel Index 0.63780 0.51232 10.43450 0.03736 180.962 685 389 520 495
NutShell Sports 0.60836 0.48413 11.51040 -0.13290 219.774 553 356 427 455
print(summary(predictions$`Straight Up`))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6084  0.6486  0.6558  0.6531  0.6607  0.6758

FiveThirtyEight’s accuracy matches the median almost exactly, so not bad and not good, but the different between the top and the bottom isn’t a huge margin.

ESPN is near the top (if anyone knows how to get their raw data that would be awesome), so let’s take a look at their predicted standings, plus we can grab the conference affiliations.

# Get ESPN's predicted standings
url <- 'http://www.espn.com/nba/story/_/page/BPI-Playoff-Odds/espn-nba-basketball-power-index-playoff-odds'
xpath <- '//*[@id="article-feed"]/article[1]/div/div[2]/aside[1]/table'
ESPNStandingsEast <- url %>%
  read_html() %>%
  html_nodes(xpath = xpath) %>%
  html_table()
ESPNStandingsEast <- as_tibble(ESPNStandingsEast[[1]]) %>% 
  mutate(Conference = "East")
xpath <- '//*[@id="article-feed"]/article[1]/div/div[2]/aside[2]/table'
ESPNStandingsWest <- url %>%
  read_html() %>%
  html_nodes(xpath = xpath) %>%
  html_table()
ESPNStandingsWest <- as_tibble(ESPNStandingsWest[[1]]) %>%
  mutate(Conference = "West")
ESPNStandings <- bind_rows(ESPNStandingsEast,ESPNStandingsWest)
rm(ESPNStandingsEast,ESPNStandingsWest)
formattable(ESPNStandings)
TEAM WIN-LOSS PLAY-OFFS NBATITLE Conference
MIL 62-20 100.0% 43.7% East
TOR 57-25 100.0% 14.2% East
PHI 53-29 100.0% 2.5% East
BOS 48-34 100.0% 2.8% East
IND 48-34 100.0% 1.0% East
DET 42-40 95.8% <0.1% East
BKN 41-41 78.1% <0.1% East
ORL 40-42 74.8% <0.1% East
MIA 40-42 46.5% <0.1% East
CHA 38-44 4.8% 0.0% East
WSH 33-49 0.0% 0.0% East
ATL 28-54 0.0% 0.0% East
CHI 23-59 0.0% 0.0% East
CLE 20-62 0.0% 0.0% East
NY 16-66 0.0% 0.0% East
GS 56-26 100.0% 15.5% West
DEN 55-27 100.0% 5.5% West
HOU 52-30 100.0% 5.1% West
POR 52-30 100.0% 2.4% West
UTH 50-32 100.0% 5.6% West
LAC 49-33 100.0% 0.2% West
SA 48-34 100.0% 0.4% West
OKC 47-35 100.0% 1.1% West
SAC 39-43 0.0% 0.0% West
MIN 37-45 0.0% 0.0% West
LAL 36-46 0.0% 0.0% West
NO 35-47 0.0% 0.0% West
MEM 33-49 0.0% 0.0% West
DAL 32-50 0.0% 0.0% West
PHX 18-64 0.0% 0.0% West

There are a few differences between ESPN’s and FiveThirtyEight’s predictions, but this late in the season it would be hard to have many major differences.

While we’ve got the ESPN data, we can grab the conference affiliations to sort our standings later.

# compare names between ESPN and Fivethirtyeight
ComparisonTable <- tibble(Fivethirtyeight = sort(unique(carmelo2019$team1)),
                          ESPN = sort(unique(ESPNStandings$TEAM)))
formattable(ComparisonTable)
Fivethirtyeight ESPN
ATL ATL
BOS BKN
BRK BOS
CHI CHA
CHO CHI
CLE CLE
DAL DAL
DEN DEN
DET DET
GSW GS
HOU HOU
IND IND
LAC LAC
LAL LAL
MEM MEM
MIA MIA
MIL MIL
MIN MIN
NOP NO
NYK NY
OKC OKC
ORL ORL
PHI PHI
PHO PHX
POR POR
SAC SA
SAS SAC
TOR TOR
UTA UTH
WAS WSH

There’s a bit of a problem. The names don’t match. It’s probably fastest to manually fix this, but I’ll just give you the fix directly.

#manually fix
ComparisonTable <- structure(list(Fivethirtyeight = c("ATL", "BOS", "BRK", "CHI", 
"CHO", "CLE", "DAL", "DEN", "DET", "GSW", "HOU", "IND", "LAC", 
"LAL", "MEM", "MIA", "MIL", "MIN", "NOP", "NYK", "OKC", "ORL", 
"PHI", "PHO", "POR", "SAC", "SAS", "TOR", "UTA", "WAS"), ESPN = c("ATL", 
"BOS", "BKN", "CHI", "CHA", "CLE", "DAL", "DEN", "DET", "GS", 
"HOU", "IND", "LAC", "LAL", "MEM", "MIA", "MIL", "MIN", "NO", 
"NY", "OKC", "ORL", "PHI", "PHX", "POR", "SAC", "SA", "TOR", 
"UTH", "WSH")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -30L), spec = structure(list(cols = list(
    Fivethirtyeight = structure(list(), class = c("collector_character", 
    "collector")), ESPN = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

# add conference
ComparisonTable <- ESPNStandings %>% select(1,5) %>% rename(ESPN = TEAM) %>% 
  left_join(ComparisonTable, by = "ESPN") %>% select(1,3,2) %>% arrange(ESPN)

formattable(ComparisonTable)
ESPN Fivethirtyeight Conference
ATL ATL East
BKN BRK East
BOS BOS East
CHA CHO East
CHI CHI East
CLE CLE East
DAL DAL West
DEN DEN West
DET DET East
GS GSW West
HOU HOU West
IND IND East
LAC LAC West
LAL LAL West
MEM MEM West
MIA MIA East
MIL MIL East
MIN MIN West
NO NOP West
NY NYK East
OKC OKC West
ORL ORL East
PHI PHI East
PHX PHO West
POR POR West
SA SAS West
SAC SAC West
TOR TOR East
UTH UTA West
WSH WAS East

Great, now we can grab the conference affiliatons.

Next, let’s create a function to get the current NBA standings from the FiveThirtyEight data and print the standings.

# Compute current standings from fivethirtyeight data. Make a function
# to keep things cleaner
getStandings <- function(x){
  x1 <- x %>% select(team1,win1) %>% rename(team = team1, win = win1)
  x2 <- x %>% select(team2,win2) %>% rename(team = team2, win = win2) %>% 
    bind_rows(x1)
  x2 <- x2%>% group_by(team) %>% 
    summarize(wins = sum(win),
              total = n(),
              losses = total - wins) %>% 
    select(team,wins,losses, total)
  x2 <- ComparisonTable %>% rename(team = Fivethirtyeight) %>% 
    select(team, Conference) %>% left_join(x2, by = "team")%>%
    arrange(Conference,desc(wins))
  return(x2)
}

CarmeloStandings <- getStandings(carmelo2019)
formattable(CarmeloStandings)
team Conference wins losses total
MIL East 57 20 77
TOR East 54 23 77
PHI East 49 27 76
BOS East 45 32 77
IND East 45 32 77
BRK East 39 38 77
DET East 39 37 76
MIA East 38 38 76
ORL East 38 39 77
CHO East 35 40 75
WAS East 31 46 77
ATL East 28 49 77
CHI East 21 56 77
CLE East 19 58 77
NYK East 14 62 76
DEN West 51 24 75
GSW West 51 24 75
HOU West 49 28 77
POR West 48 28 76
LAC West 46 31 77
UTA West 46 30 76
OKC West 44 33 77
SAS West 44 32 76
SAC West 37 39 76
LAL West 34 42 76
MIN West 34 42 76
NOP West 32 45 77
MEM West 31 45 76
DAL West 30 46 76
PHO West 17 60 77

Whenever this code is run it should show the updated standings, as long as FiveThirtyEight has updated their data.

Now, let’s finally simulate the end of the regular season based exclusively on the the Carmelo model’s predicted wins and losses.

# take a look at the end of season NBA standings based on predicted future games
CarmeloStandingsPredicted <- carmelodf %>% 
  filter(is.na(score1)) %>% 
  mutate(win1 = carmelo_prob1 > carmelo_prob2,
         win2 = carmelo_prob2 > carmelo_prob1) %>%
  bind_rows(carmelo2019) %>% 
  getStandings() %>%
  group_by(Conference) %>%
  mutate(place = order(order(wins, Conference, decreasing=TRUE)))
formattable(CarmeloStandingsPredicted)
team Conference wins losses total place
MIL East 61 21 82 1
TOR East 59 23 82 2
PHI East 55 27 82 3
BOS East 50 32 82 4
IND East 48 34 82 5
DET East 43 39 82 6
BRK East 40 42 82 7
ORL East 40 42 82 8
CHO East 38 44 82 9
MIA East 38 44 82 10
WAS East 34 48 82 11
ATL East 28 54 82 12
CHI East 22 60 82 13
CLE East 19 63 82 14
NYK East 15 67 82 15
GSW West 58 24 82 1
DEN West 55 27 82 2
HOU West 53 29 82 3
POR West 52 30 82 4
UTA West 52 30 82 5
LAC West 48 34 82 6
OKC West 48 34 82 7
SAS West 48 34 82 8
SAC West 39 43 82 9
MIN West 36 46 82 10
LAL West 34 48 82 11
DAL West 33 49 82 12
NOP West 33 49 82 13
MEM West 32 50 82 14
PHO West 19 63 82 15

Great, we can now see the predicted finishes sorted by conference and wins. These standings don’t reflect tie breakers, but that’s another level I don’t want to work at today.

FiveThirtyEight says they run 50,000 simulations for their own predictions. Now, it might seem like that requires a supercomputer, but we can run our own simulation here , but, yes, it will take several minutes to run 50,000 simulations depending on your computer. I ran it and it took around 20 minutes or so, but in this code I’ll just run it 1,000 times. You can change the number of simulations in the function runSim. The end result is a table that shows the average finish for each team, although solely in terms of wins and losses.

# this code simulates the probabilities of victory and end of season rankings
runSim <- function(n){
  pb <- txtProgressBar(min = 0, max = n, style = 3)
  results <- NULL
  for(i in 1:n){
    tmp <- carmelodf %>% 
      filter(is.na(score1)) %>% 
      mutate(win1 = sapply(carmelo_prob1,function(p) rbinom(1,1,p)) == 1,
             win2 = win1 == FALSE) %>% 
      bind_rows(carmelo2019) %>% 
      getStandings() %>%
      group_by(Conference) %>%
      mutate(place = order(order(wins, Conference, decreasing=TRUE))) %>% 
      ungroup() %>% 
      select(team, place, Conference)
    results <- bind_rows(results,tmp)
    setTxtProgressBar(pb, i)
  }
  close(pb)
  finish <- results %>% group_by(team, Conference) %>% 
    summarize(AverageRank = mean(place)) %>% 
    arrange(Conference,AverageRank)
  return(finish)
}
finish <- runSim(1000)
formattable(finish)
team Conference AverageRank
MIL East 1.002
TOR East 2.000
PHI East 2.998
BOS East 4.272
IND East 4.728
DET East 6.280
BRK East 7.285
ORL East 8.107
MIA East 8.557
CHO East 9.771
WAS East 11.003
ATL East 11.997
CHI East 13.014
CLE East 13.986
NYK East 15.000
GSW West 1.144
DEN West 1.867
HOU West 3.104
POR West 4.179
UTA West 5.012
LAC West 6.008
SAS West 7.251
OKC West 7.435
SAC West 9.064
MIN West 10.389
LAL West 10.803
NOP West 12.626
DAL West 12.852
MEM West 13.266
PHO West 15.000

The FiveThirtyEight data goes back to 1947, so there’s a lot of other things we could do with it, but I’ll leave it here for now.