I’ve been following FiveThirtyEight’s NBA predictions for the last few months, and I saw today that they make their data available for download. One problem I have with the table on their website is that it doesn’t have a way to get a nice East/West predicted finish table. Their predictions run on a model they call Carmelo, which you can read about on their website. Today I thought I’d take a look at their data, analyze their prediction’s accuracy, compare it to other websites, get their predicted finish in a nice table, and, last, run a nice long simulation of predicted finishes.
First thing is to download the data. The great thing is this data is constantly updated so this analysis can be rerun anytime you’d like.
# load packages
if (!is.element('pacman', installed.packages()[,1])) install.packages('pacman', dep = TRUE)
pacman::p_load(rio,tidyverse,rvest,formattable)
# download Carmelo data
carmelofull <- import("https://projects.fivethirtyeight.com/nba-model/nba_elo.csv")
carmelodf <- carmelofull %>% filter(season == 2019) %>%
select(1,2,5,6,20:24)
Now we can take a look at how often their predictions have been right so far this season.
# take a look at this season's prediction success
carmelo2019 <- carmelodf %>%
filter(!is.na(score1)) %>%
mutate(win1 = score1 > score2,
win2 = score1 < score2,
predicted = carmelo_prob1 > carmelo_prob2,
correct = win1 == predicted)
# get the total
print(sum(carmelo2019$correct) / nrow(carmelo2019))
## [1] 0.6553229
So they’re right about two out of three games. That doesn’t seem too bad, but certainly doesn’t mean they’ll get everything right. What’s missing is perspective. Let’s take a look at how other predictions have performed.
Conveniently, thePredictionTracker.com has already done this for us. We can scrape the web table using the below code, and then take a look at the accuracies.
# download table of aggregated NBA predictions
url <- "http://www.thepredictiontracker.com/nbaresults.php"
xpath <- '//*[@id="content_wrapper"]/center/center/div[1]/div/div/table'
predictions <- url %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table()
predictions <- as_tibble(predictions[[1]])
predictions <- predictions %>%
arrange(desc(`Straight Up`))
formattable(predictions)
Sytem | Straight Up | ATS | Absolute Error | Bias | Mean SquareError | Straight UpWins | Straightup Losses | ATS Wins | ATS Losses |
---|---|---|---|---|---|---|---|---|---|
Opening Line | 0.67582 | 0.47912 | 10.02710 | -0.50000 | 167.470 | 738 | 354 | 413 | 449 |
Line | 0.67037 | . | 9.82762 | -0.45876 | 162.108 | 724 | 356 | . | . |
System Average | 0.66489 | 0.50515 | 10.14760 | -0.26048 | 171.072 | 748 | 377 | 539 | 528 |
ESPN BPI | 0.66154 | 0.51198 | 10.12720 | 0.02468 | 170.827 | 731 | 374 | 534 | 509 |
Talisman Red | 0.66132 | 0.52399 | 10.28930 | -0.24057 | 175.377 | 742 | 380 | 557 | 506 |
Sagarin Golden Mean | 0.66044 | 0.49342 | 10.24120 | -0.39212 | 172.603 | 743 | 382 | 525 | 539 |
Sagarin Rating | 0.65956 | 0.49953 | 10.20820 | -0.39724 | 172.173 | 742 | 383 | 532 | 533 |
Powerrankings.com | 0.65689 | 0.50765 | 10.18820 | 1.08133 | 173.493 | 739 | 386 | 531 | 515 |
Sagarin Predictor | 0.65689 | 0.50141 | 10.24240 | -0.37231 | 172.768 | 739 | 386 | 535 | 532 |
StatFox | 0.65600 | 0.50940 | 10.38400 | -0.02081 | 178.780 | 738 | 387 | 515 | 496 |
RoundTable | 0.65564 | 0.48845 | 10.43480 | 0.04183 | 178.491 | 674 | 354 | 444 | 465 |
Dokter Entropy | 0.65422 | 0.48735 | 10.26950 | -1.85500 | 174.807 | 736 | 389 | 520 | 547 |
Massey Ratings | 0.65244 | 0.49577 | 10.28270 | -0.02758 | 175.545 | 734 | 391 | 528 | 537 |
Sonny Moore | 0.64978 | 0.50188 | 10.34360 | -0.62353 | 178.033 | 731 | 394 | 535 | 531 |
ComPughter Ratings | 0.64978 | 0.50470 | 10.44420 | 0.35868 | 178.318 | 731 | 394 | 537 | 527 |
D Ratings.com | 0.64502 | 0.49570 | 10.29010 | -0.03157 | 174.184 | 725 | 399 | 519 | 528 |
Logical Approach | 0.64411 | 0.49324 | 10.32120 | -1.87453 | 174.928 | 514 | 284 | 365 | 375 |
Sagarin Recent | 0.64089 | 0.49343 | 10.63640 | -0.46798 | 186.269 | 721 | 404 | 526 | 540 |
Dunkel Index | 0.63780 | 0.51232 | 10.43450 | 0.03736 | 180.962 | 685 | 389 | 520 | 495 |
NutShell Sports | 0.60836 | 0.48413 | 11.51040 | -0.13290 | 219.774 | 553 | 356 | 427 | 455 |
print(summary(predictions$`Straight Up`))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6084 0.6486 0.6558 0.6531 0.6607 0.6758
FiveThirtyEight’s accuracy matches the median almost exactly, so not bad and not good, but the different between the top and the bottom isn’t a huge margin.
ESPN is near the top (if anyone knows how to get their raw data that would be awesome), so let’s take a look at their predicted standings, plus we can grab the conference affiliations.
# Get ESPN's predicted standings
url <- 'http://www.espn.com/nba/story/_/page/BPI-Playoff-Odds/espn-nba-basketball-power-index-playoff-odds'
xpath <- '//*[@id="article-feed"]/article[1]/div/div[2]/aside[1]/table'
ESPNStandingsEast <- url %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table()
ESPNStandingsEast <- as_tibble(ESPNStandingsEast[[1]]) %>%
mutate(Conference = "East")
xpath <- '//*[@id="article-feed"]/article[1]/div/div[2]/aside[2]/table'
ESPNStandingsWest <- url %>%
read_html() %>%
html_nodes(xpath = xpath) %>%
html_table()
ESPNStandingsWest <- as_tibble(ESPNStandingsWest[[1]]) %>%
mutate(Conference = "West")
ESPNStandings <- bind_rows(ESPNStandingsEast,ESPNStandingsWest)
rm(ESPNStandingsEast,ESPNStandingsWest)
formattable(ESPNStandings)
TEAM | WIN-LOSS | PLAY-OFFS | NBATITLE | Conference |
---|---|---|---|---|
MIL | 62-20 | 100.0% | 43.7% | East |
TOR | 57-25 | 100.0% | 14.2% | East |
PHI | 53-29 | 100.0% | 2.5% | East |
BOS | 48-34 | 100.0% | 2.8% | East |
IND | 48-34 | 100.0% | 1.0% | East |
DET | 42-40 | 95.8% | <0.1% | East |
BKN | 41-41 | 78.1% | <0.1% | East |
ORL | 40-42 | 74.8% | <0.1% | East |
MIA | 40-42 | 46.5% | <0.1% | East |
CHA | 38-44 | 4.8% | 0.0% | East |
WSH | 33-49 | 0.0% | 0.0% | East |
ATL | 28-54 | 0.0% | 0.0% | East |
CHI | 23-59 | 0.0% | 0.0% | East |
CLE | 20-62 | 0.0% | 0.0% | East |
NY | 16-66 | 0.0% | 0.0% | East |
GS | 56-26 | 100.0% | 15.5% | West |
DEN | 55-27 | 100.0% | 5.5% | West |
HOU | 52-30 | 100.0% | 5.1% | West |
POR | 52-30 | 100.0% | 2.4% | West |
UTH | 50-32 | 100.0% | 5.6% | West |
LAC | 49-33 | 100.0% | 0.2% | West |
SA | 48-34 | 100.0% | 0.4% | West |
OKC | 47-35 | 100.0% | 1.1% | West |
SAC | 39-43 | 0.0% | 0.0% | West |
MIN | 37-45 | 0.0% | 0.0% | West |
LAL | 36-46 | 0.0% | 0.0% | West |
NO | 35-47 | 0.0% | 0.0% | West |
MEM | 33-49 | 0.0% | 0.0% | West |
DAL | 32-50 | 0.0% | 0.0% | West |
PHX | 18-64 | 0.0% | 0.0% | West |
There are a few differences between ESPN’s and FiveThirtyEight’s predictions, but this late in the season it would be hard to have many major differences.
While we’ve got the ESPN data, we can grab the conference affiliations to sort our standings later.
# compare names between ESPN and Fivethirtyeight
ComparisonTable <- tibble(Fivethirtyeight = sort(unique(carmelo2019$team1)),
ESPN = sort(unique(ESPNStandings$TEAM)))
formattable(ComparisonTable)
Fivethirtyeight | ESPN |
---|---|
ATL | ATL |
BOS | BKN |
BRK | BOS |
CHI | CHA |
CHO | CHI |
CLE | CLE |
DAL | DAL |
DEN | DEN |
DET | DET |
GSW | GS |
HOU | HOU |
IND | IND |
LAC | LAC |
LAL | LAL |
MEM | MEM |
MIA | MIA |
MIL | MIL |
MIN | MIN |
NOP | NO |
NYK | NY |
OKC | OKC |
ORL | ORL |
PHI | PHI |
PHO | PHX |
POR | POR |
SAC | SA |
SAS | SAC |
TOR | TOR |
UTA | UTH |
WAS | WSH |
There’s a bit of a problem. The names don’t match. It’s probably fastest to manually fix this, but I’ll just give you the fix directly.
#manually fix
ComparisonTable <- structure(list(Fivethirtyeight = c("ATL", "BOS", "BRK", "CHI",
"CHO", "CLE", "DAL", "DEN", "DET", "GSW", "HOU", "IND", "LAC",
"LAL", "MEM", "MIA", "MIL", "MIN", "NOP", "NYK", "OKC", "ORL",
"PHI", "PHO", "POR", "SAC", "SAS", "TOR", "UTA", "WAS"), ESPN = c("ATL",
"BOS", "BKN", "CHI", "CHA", "CLE", "DAL", "DEN", "DET", "GS",
"HOU", "IND", "LAC", "LAL", "MEM", "MIA", "MIL", "MIN", "NO",
"NY", "OKC", "ORL", "PHI", "PHX", "POR", "SAC", "SA", "TOR",
"UTH", "WSH")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -30L), spec = structure(list(cols = list(
Fivethirtyeight = structure(list(), class = c("collector_character",
"collector")), ESPN = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
# add conference
ComparisonTable <- ESPNStandings %>% select(1,5) %>% rename(ESPN = TEAM) %>%
left_join(ComparisonTable, by = "ESPN") %>% select(1,3,2) %>% arrange(ESPN)
formattable(ComparisonTable)
ESPN | Fivethirtyeight | Conference |
---|---|---|
ATL | ATL | East |
BKN | BRK | East |
BOS | BOS | East |
CHA | CHO | East |
CHI | CHI | East |
CLE | CLE | East |
DAL | DAL | West |
DEN | DEN | West |
DET | DET | East |
GS | GSW | West |
HOU | HOU | West |
IND | IND | East |
LAC | LAC | West |
LAL | LAL | West |
MEM | MEM | West |
MIA | MIA | East |
MIL | MIL | East |
MIN | MIN | West |
NO | NOP | West |
NY | NYK | East |
OKC | OKC | West |
ORL | ORL | East |
PHI | PHI | East |
PHX | PHO | West |
POR | POR | West |
SA | SAS | West |
SAC | SAC | West |
TOR | TOR | East |
UTH | UTA | West |
WSH | WAS | East |
Great, now we can grab the conference affiliatons.
Next, let’s create a function to get the current NBA standings from the FiveThirtyEight data and print the standings.
# Compute current standings from fivethirtyeight data. Make a function
# to keep things cleaner
getStandings <- function(x){
x1 <- x %>% select(team1,win1) %>% rename(team = team1, win = win1)
x2 <- x %>% select(team2,win2) %>% rename(team = team2, win = win2) %>%
bind_rows(x1)
x2 <- x2%>% group_by(team) %>%
summarize(wins = sum(win),
total = n(),
losses = total - wins) %>%
select(team,wins,losses, total)
x2 <- ComparisonTable %>% rename(team = Fivethirtyeight) %>%
select(team, Conference) %>% left_join(x2, by = "team")%>%
arrange(Conference,desc(wins))
return(x2)
}
CarmeloStandings <- getStandings(carmelo2019)
formattable(CarmeloStandings)
team | Conference | wins | losses | total |
---|---|---|---|---|
MIL | East | 57 | 20 | 77 |
TOR | East | 54 | 23 | 77 |
PHI | East | 49 | 27 | 76 |
BOS | East | 45 | 32 | 77 |
IND | East | 45 | 32 | 77 |
BRK | East | 39 | 38 | 77 |
DET | East | 39 | 37 | 76 |
MIA | East | 38 | 38 | 76 |
ORL | East | 38 | 39 | 77 |
CHO | East | 35 | 40 | 75 |
WAS | East | 31 | 46 | 77 |
ATL | East | 28 | 49 | 77 |
CHI | East | 21 | 56 | 77 |
CLE | East | 19 | 58 | 77 |
NYK | East | 14 | 62 | 76 |
DEN | West | 51 | 24 | 75 |
GSW | West | 51 | 24 | 75 |
HOU | West | 49 | 28 | 77 |
POR | West | 48 | 28 | 76 |
LAC | West | 46 | 31 | 77 |
UTA | West | 46 | 30 | 76 |
OKC | West | 44 | 33 | 77 |
SAS | West | 44 | 32 | 76 |
SAC | West | 37 | 39 | 76 |
LAL | West | 34 | 42 | 76 |
MIN | West | 34 | 42 | 76 |
NOP | West | 32 | 45 | 77 |
MEM | West | 31 | 45 | 76 |
DAL | West | 30 | 46 | 76 |
PHO | West | 17 | 60 | 77 |
Whenever this code is run it should show the updated standings, as long as FiveThirtyEight has updated their data.
Now, let’s finally simulate the end of the regular season based exclusively on the the Carmelo model’s predicted wins and losses.
# take a look at the end of season NBA standings based on predicted future games
CarmeloStandingsPredicted <- carmelodf %>%
filter(is.na(score1)) %>%
mutate(win1 = carmelo_prob1 > carmelo_prob2,
win2 = carmelo_prob2 > carmelo_prob1) %>%
bind_rows(carmelo2019) %>%
getStandings() %>%
group_by(Conference) %>%
mutate(place = order(order(wins, Conference, decreasing=TRUE)))
formattable(CarmeloStandingsPredicted)
team | Conference | wins | losses | total | place |
---|---|---|---|---|---|
MIL | East | 61 | 21 | 82 | 1 |
TOR | East | 59 | 23 | 82 | 2 |
PHI | East | 55 | 27 | 82 | 3 |
BOS | East | 50 | 32 | 82 | 4 |
IND | East | 48 | 34 | 82 | 5 |
DET | East | 43 | 39 | 82 | 6 |
BRK | East | 40 | 42 | 82 | 7 |
ORL | East | 40 | 42 | 82 | 8 |
CHO | East | 38 | 44 | 82 | 9 |
MIA | East | 38 | 44 | 82 | 10 |
WAS | East | 34 | 48 | 82 | 11 |
ATL | East | 28 | 54 | 82 | 12 |
CHI | East | 22 | 60 | 82 | 13 |
CLE | East | 19 | 63 | 82 | 14 |
NYK | East | 15 | 67 | 82 | 15 |
GSW | West | 58 | 24 | 82 | 1 |
DEN | West | 55 | 27 | 82 | 2 |
HOU | West | 53 | 29 | 82 | 3 |
POR | West | 52 | 30 | 82 | 4 |
UTA | West | 52 | 30 | 82 | 5 |
LAC | West | 48 | 34 | 82 | 6 |
OKC | West | 48 | 34 | 82 | 7 |
SAS | West | 48 | 34 | 82 | 8 |
SAC | West | 39 | 43 | 82 | 9 |
MIN | West | 36 | 46 | 82 | 10 |
LAL | West | 34 | 48 | 82 | 11 |
DAL | West | 33 | 49 | 82 | 12 |
NOP | West | 33 | 49 | 82 | 13 |
MEM | West | 32 | 50 | 82 | 14 |
PHO | West | 19 | 63 | 82 | 15 |
Great, we can now see the predicted finishes sorted by conference and wins. These standings don’t reflect tie breakers, but that’s another level I don’t want to work at today.
FiveThirtyEight says they run 50,000 simulations for their own predictions. Now, it might seem like that requires a supercomputer, but we can run our own simulation here , but, yes, it will take several minutes to run 50,000 simulations depending on your computer. I ran it and it took around 20 minutes or so, but in this code I’ll just run it 1,000 times. You can change the number of simulations in the function runSim. The end result is a table that shows the average finish for each team, although solely in terms of wins and losses.
# this code simulates the probabilities of victory and end of season rankings
runSim <- function(n){
pb <- txtProgressBar(min = 0, max = n, style = 3)
results <- NULL
for(i in 1:n){
tmp <- carmelodf %>%
filter(is.na(score1)) %>%
mutate(win1 = sapply(carmelo_prob1,function(p) rbinom(1,1,p)) == 1,
win2 = win1 == FALSE) %>%
bind_rows(carmelo2019) %>%
getStandings() %>%
group_by(Conference) %>%
mutate(place = order(order(wins, Conference, decreasing=TRUE))) %>%
ungroup() %>%
select(team, place, Conference)
results <- bind_rows(results,tmp)
setTxtProgressBar(pb, i)
}
close(pb)
finish <- results %>% group_by(team, Conference) %>%
summarize(AverageRank = mean(place)) %>%
arrange(Conference,AverageRank)
return(finish)
}
finish <- runSim(1000)
formattable(finish)
team | Conference | AverageRank |
---|---|---|
MIL | East | 1.002 |
TOR | East | 2.000 |
PHI | East | 2.998 |
BOS | East | 4.272 |
IND | East | 4.728 |
DET | East | 6.280 |
BRK | East | 7.285 |
ORL | East | 8.107 |
MIA | East | 8.557 |
CHO | East | 9.771 |
WAS | East | 11.003 |
ATL | East | 11.997 |
CHI | East | 13.014 |
CLE | East | 13.986 |
NYK | East | 15.000 |
GSW | West | 1.144 |
DEN | West | 1.867 |
HOU | West | 3.104 |
POR | West | 4.179 |
UTA | West | 5.012 |
LAC | West | 6.008 |
SAS | West | 7.251 |
OKC | West | 7.435 |
SAC | West | 9.064 |
MIN | West | 10.389 |
LAL | West | 10.803 |
NOP | West | 12.626 |
DAL | West | 12.852 |
MEM | West | 13.266 |
PHO | West | 15.000 |
The FiveThirtyEight data goes back to 1947, so there’s a lot of other things we could do with it, but I’ll leave it here for now.