Why I like R — web scraping and the NBA playoffs

Learning R has been fun for a number of reasons for me. One is simply that I find programming fun. It’s like solving puzzles for me. Sure there’s a lot of frustration involved, but I find it worthwhile. R has been fun because I love numbers and it’s designed for statistics. Another passion is efficiency. I don’t like to do things by hand if I have to, and I love to automate things. Today I wanted to demonstrate some of the possibilities R has to offer. Instead of something archaeology related, I’ll talk about another passion of mine–sports.

The NBA playoffs are about to start and my beloved Utah Jazz are racing for the number one seed. I frequently check fivethirtyeight’s and ESPN’s playoff predictions. These are great, but they don’t show the odds of ending up in a particular playoff spot. I decided to sit down and see if I could calculate the odds before the game against Golden State ended tonight.

If you’re not interested in the code, then skip to the end, and I’ll tell you the numbers. This isn’t an introduction to R, it’s more me showing some of the capabilities of R. The great thing is, you can adapt this code to see the odds your team hits a certain number of wins.

First I love the tidyverse, which is a collection of packages that make R easier to use and understand. I also like some of the features of the magrittr package that aren’t loaded with the tidyverse packages. Rvest is a package to scrape the web, and lubridate is a great package for working with dates.

library(magrittr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract()   masks magrittr::extract()
## x dplyr::filter()    masks stats::filter()
## x dplyr::lag()       masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(rvest)
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

I decided to use ESPN’s BPI game odds. These are adjusted with lots of variables and are decently accurate.

First, I grabbed the Utah Jazz schedule.

There’s only one table on the page so it wasn’t hard to access, but I did have to clean up the data a little bit due to a postponed game and an extra header row. I originally filtered those values here, but I needed the original values for the next section.

Next, I had to get the links to each remaining game so I could pull the odds of winning.

The %<>% function is one of my favorites as it takes whatever is on the left, uses it in the function to the right and also assigns the result of the function back to that variable. Thus I can save the typing for schedule = schedule %>% and it makes it easy to test code before saving it to a variable by just using %>% for the test and then adding in the last < symbol once I have it right.

I won’t describe what xpaths are in detail, but they can be used to identify specific elements on a page. The table rows match up to the xpath so I can get all of the links to the individual games by just changing what is essentially the row number. This was easier to do before the game started, but afterwards the link disappeared, so I had to find a workaround that works more consistently.

url = "https://www.espn.com/nba/team/schedule/_/name/utah" %>% 
  read_html()

schedule = url %>%
  html_node("table") %>%
  html_table(header = T)

gameIDs = map(1:nrow(schedule),~{
  xpath = paste0('//*[@id="fittPageContainer"]/div[2]/div[5]/div/div/section/div/section/section/div/div/div/div[2]/table/tbody/tr[',.x+1,']/td[3]/span/a')
url %>% 
  html_elements(xpath = xpath) %>% 
  html_attr("href")
})
schedule %<>%
  mutate(gameID = gameIDs) %>% 
  filter(DATE != 'DATE',RESULT != 'Postponed')

Next, I identified the remaining games. There are a few ways to do this, but I decided to get complicated and convert the date in the schedule table to a real date so I could filter for games today and later. It might be better to just filter for games that are not completed, but this way shows how R can be used for time series.

schedule %<>% 
  mutate_at(vars(DATE),list(~.x %>% 
                              str_remove_all("^.*?,") %>%
                              trimws %>% 
                              parse_date_time("Om d"))) %>% 
  mutate(DATE = case_when(month(DATE) == 12~`year<-`(DATE, 2020),
                          TRUE~`year<-`(DATE, 2021)))

remaining = schedule %>% 
  filter(DATE >= today())

Next, I used a purrr map function to go through each game link and get the odds of winning for the home team. I then calculated the Jazz odds of winning by determining who was the home team and inverting the odds if necessary. Because I ran this during a game the game link was missing so I used the invalidate function from gtools to add in the odds for tonight. I haven’t used this function much but it seems a good catchall for values that are problematic (e.g., null,NA, or empty values).

remaining %<>% 
  mutate(HomePred = map_chr(gameID,~{
    if(gtools::invalid(.x)){
      result = '49%' # this was the odds pregame for Jazz vs Warriors
    } else {
    xpath = 
      '//*[@id="gamepackage-predictor"]/div/div/div[1]/div[1]/div/div/span[1]'
    result = .x %>% read_html() %>% html_node(xpath = xpath) %>% html_text()
    }
    return(result)
  })) %>% 
  mutate_at(vars(HomePred),list(~.x %>%
                                  sub("%","",.) %>%
                                  as.double())) %>% 
  mutate(WinPer = case_when(str_detect(OPPONENT,"@")~(100-HomePred)/100,
                            TRUE~HomePred/100))

Last, I created a function to simulate the remaining wins using 1 as a win and zero as a loss and replicated that function 10,000 times. The simulation went fairly quickly on my computer, and I used the nice and simple tictoc package to show the time elapsed.

predictWins = function(probs = remaining$WinPer){
  map_int(probs,~{
    sample(1:0,1,replace = T,prob = c(.x,1-.x)) %>% sum
  }) %>% sum
}

tictoc::tic()
sims = replicate(10000,predictWins(remaining$WinPer))
tictoc::toc()
## 1.14 sec elapsed

With these results I could look at the odds Utah wins its remaining games.

prop.table(table(sims))
## sims
##      0      1      2      3      4 
## 0.0085 0.0873 0.2913 0.4208 0.1921

The Phoenix Suns are closing in on the Jazz. I could run the above code again but substituting the Phoenix Suns schedule url for the Jazz url. As a general rule, copying and pasting code is a bad idea. If I have to fix something in the code, then I have to fix it in multiple places. Instead I can turn everything I did into a function.

predictRemaining = function(team){
  
url = glue::glue("https://www.espn.com/nba/team/schedule/_/name/{team}") %>% 
  read_html()

schedule = url %>%
  html_node("table") %>%
  html_table(header = T)

gameIDs = map(1:nrow(schedule),~{
  xpath = paste0('//*[@id="fittPageContainer"]/div[2]/div[5]/div/div/section/div/section/section/div/div/div/div[2]/table/tbody/tr[',.x+1,']/td[3]/span/a')
url %>% 
  html_elements(xpath = xpath) %>% 
  html_attr("href")
})
schedule %<>%
  mutate(gameID = gameIDs) %>% 
  filter(DATE != 'DATE',RESULT != 'Postponed')

schedule %<>% 
  mutate_at(vars(DATE),list(~.x %>% 
                              str_remove_all("^.*?,") %>%
                              trimws %>% 
                              parse_date_time("Om d"))) %>% 
  mutate(DATE = case_when(month(DATE) == 12~`year<-`(DATE, 2020),
                          TRUE~`year<-`(DATE, 2021)))

remaining = schedule %>% 
  filter(DATE >= today())

remaining %<>% 
  mutate(HomePred = map_chr(gameID,~{
    if(gtools::invalid(.x)){
      result = '49%' # this was the odds pregame for Jazz vs Warriors
    } else {
    xpath = 
      '//*[@id="gamepackage-predictor"]/div/div/div[1]/div[1]/div/div/span[1]'
    result = .x %>% read_html() %>% html_node(xpath = xpath) %>% html_text()
    }
    return(result)
  })) %>% 
  mutate_at(vars(HomePred),list(~.x %>%
                                  sub("%","",.) %>%
                                  as.double())) %>% 
  mutate(WinPer = case_when(str_detect(OPPONENT,"@")~(100-HomePred)/100,
                            TRUE~HomePred/100))

predictWins = function(probs = remaining$WinPer){
  map_int(probs,~{
    sample(1:0,1,replace = T,prob = c(.x,1-.x)) %>% sum
  }) %>% sum
}

sims = replicate(10000,predictWins(remaining$WinPer))
  
  return(list(schedule = schedule, remaining = remaining, sims = sims))
}

With this function I can simplify the code and compare the results.

utah = predictRemaining("utah")
sum(startsWith(utah$schedule$RESULT,"W"))
## [1] 50
prop.table(table(utah$sims))
## 
##      0      1      2      3      4 
## 0.0087 0.0856 0.3047 0.4127 0.1883
phoenix = predictRemaining("phoenix")
sum(startsWith(phoenix$schedule$RESULT,"W"))
## [1] 48
prop.table(table(phoenix$sims))
## 
##      0      1      2      3      4 
## 0.0094 0.0984 0.3198 0.4024 0.1700

If Utah wins zero more games then the Suns have a 0.89 chance of getting the one seed.

If Utah wins one more game then the Suns have a 0.57 chance of getting the one seed.

If Utah wins two more games then the Suns have a 0.17 chance of getting the one seed.

The odds of Utah winning two or more games is 0.91 chance of getting the one seed.

I can calculate the total odds of the Jazz getting the number one seed like by adding two to the Jazz total as that is the current lead. The Suns have the tiebreaker though.

adj = utah$sims + 2
round(1 - sum(phoenix$sims >= adj) / length(phoenix$sims),2)
## [1] 0.89

The result is that the Utah Jazz have a 0.89 chance of getting the number one seed. This is pretty close to what I found elsewhere and good news personally.

The great thing is I can rerun this code whenever I want and it should work. Of course, that doesn’t always work as intended but at least I learn some new skills every time I encounter a problem. I’ve used R for a few years now, but I still visited eleven different stack overflow questions just to write this.

Written on May 10, 2021