Web scraping

class: center, middle

# Web scraping

.class-info[

**Week 13**

AEM 2850 / 5850 : R for Business Analytics<br>
Cornell Dyson<br>
Spring 2025

]

---

# Announcements

Only two full weeks left!

Remaining deadlines:
  - Homework-13 will be due Monday
  - Homework-14 will be our example in class next Thursday
  - Prelim 2 on May 6 in class (two weeks from today)

Questions before we get started?

---

# Plan for today

[Web scraping basics](#web-scraping-basics)

[Web scraping with rvest](#rvest)
  - [Cornell sports](#cornell-sports)
  - [College rankings](#college-rankings)

[Group project debrief](#group-project)

[Prelim 2](#prelim-2)

---
class: inverse, center, middle
name: web-scraping-basics

# Web scraping basics

---
# What is web scraping?

Getting data or "content" off the web and onto our computers

We get content off the web all the time!
- Copy and paste
- Read and take notes
- Screenshot

The goal of web **scraping** is to write computer code to help us automate this process and store the results in a machine-readable format

---

# Why would we want to scrape data?

When is web scraping useful?

- When the data is publicly available

- When you can't get the data in a more convenient format

When is web scraping not useful?

- When data is publicly available in other formats (e.g., csv)

- When the site owner offers a way to access data directly (e.g., via an API)

Web scraping is time consuming and costly (for both you and "them")

---

# Server-side vs client-side content

### 1. Server-side

- Host server "builds" site and sends HTML code that our browser renders
- All the information is embedded in the website's HTML

### 2. Client-side

- Site contains an empty template of HTML and CSS
- When we visit, our browser sends a *request* to the host server
- The server sends a *response* script that our browser uses to populate the HTML template with information we want

**We will focus on server-side web scraping due to time constraints**

---

# What is HTML?

HTML stands for "HyperText Markup Language" and looks like this:

``` r
<html>
<head>
  <title>Page title</title>
</head>
<body>
  <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>
```

---

# What is HTML?

HTML has a hierarchical structure formed by **elements** that consist of:
1. a start tag
  - optional attributes
2. contents
3. an end tag

???

Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html)

---

# What is HTML?

HTML has a hierarchical structure formed by **elements** that consist of:
1. a start tag (e.g., `<h1>`)
  - optional attributes (e.g., `id='first'`)
2. contents in between tags (e.g., `A heading`)
3. an end tag (e.g., `</h1>`)

``` r
<html>
<head>
  <title>Page title</title>
</head>
<body>
* <h1 id='first'>A heading</h1>
  <p>Some text &amp; <b>some bold text.</b></p>
  <img src='myimg.png' width='100' height='100'>
</body>
```

???

Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html)

---

# What is HTML?

**Elements**

- There are over 100 HTML elements
- Google tags to learn about them as needed

**Contents**

- Most elements can have content in between start and end tags
- Content can be text or more elements (as **children**)

**Attributes**

- Attributes like `id` and `class` are used with CSS to control page appearance
- These attributes are useful for scraping data

???

Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html)

---

# What is CSS?

CSS stands for **C**ascading **S**tyle **S**heets

- Tool for defining visual appearance of HTML

**CSS selectors** help identify what we want to scrape

We will learn by example using the extension/bookmarklet [SelectorGadget](https://selectorgadget.com)

???

Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html)

---
class: inverse, center, middle
name: rvest

# Web scraping with rvest

---

# The rvest package

[rvest](https://rvest.tidyverse.org/index.html) (as in "harvest") is part of the tidyverse

``` r
library(rvest) # installed with tidyverse but needs to be loaded
```

We will cover several functions that make it easy to scrape data from web pages:
- `read_html` reads HTML, much like `read_csv` reads .csv files
- `html_element(s)` find HTML elements using CSS selectors or XPath expressions
- `html_text2` retrieves text from HTML elements
- `html_table` parses HTML tables into data frames

Let's learn these commands by working through two examples

---
name: cornell-sports

# Example 1: Cornell Big Red on Wikipedia

How could we scrape a list of varsity sports?

.center[
<figure>
  <a href="https://en.wikipedia.org/wiki/Cornell_Big_Red">
    <img src="img/13/big-red.png" width="90%">
  </a>
</figure>
]

???

Source: https://en.wikipedia.org/wiki/Cornell_Big_Red

---

# Option 1: use `dt` tag to get headings

???

Source: https://en.wikipedia.org/wiki/Cornell_Big_Red

---

# Scraping text using `dt` tag

Step 1: use `read_html()` to read in html from the url of interest

``` r
*big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red")

big_red
```

```
## {html_document}
## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...
```

---

# Scraping text using `dt` tag

Step 2: use `html_elements()` to extract every instance of a `dt` tag

``` r
big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red")

big_red |> 
* html_elements("dt") |> # dt tag is for terms in a description list
  head(8)
```

```
## {xml_nodeset (8)}
## [1] <dt>Baseball</dt>
## [2] <dt>Men's basketball</dt>
## [3] <dt>Women's basketball</dt>
## [4] <dt>Men's cross country</dt>
## [5] <dt>Women's cross country</dt>
## [6] <dt>Women's fencing</dt>
## [7] <dt>Football</dt>
## [8] <dt>Sprint football</dt>
```

---

# Scraping text using `dt` tag

Step 3: use `html_text2()` to convert the sports to a character vector

``` r
big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red")

big_red_text <- big_red |> 
  html_elements("dt") |> # dt tag is for terms in a description list
* html_text2()           # convert html to text

head(big_red_text)       # looks good!
```

```
## [1] "Baseball"              "Men's basketball"      "Women's basketball"   
## [4] "Men's cross country"   "Women's cross country" "Women's fencing"
```

.less-left[

``` r
length(big_red_text) # hmm...
```

```
## [1] 83
```
]

.more-right[

``` r
tail(big_red_text) # uh-oh...
```

```
## [1] "WFTDA" "MRDA"  "USARL" "NARL"  "USAR"  "WTT"
```
]

That doesn't seem right...

---

# What went wrong?

.less-left[
1. Got irrelevant data
]

.more-right[
<figure>
  <img src="img/13/big-red-selector-external.png" width="100%">
</figure>
]

---

# What went wrong?

.less-left[
1. Got irrelevant data

2. Didn't get relevant data
]

.more-right.center[
<figure>
  <img src="img/13/big-red-selector-other.png" width="65%">
</figure>
]

---

# Option 2: use `.wikitable` tag to get table

.center[
<figure>
  <img src="img/13/big-red-selector-table.png" width="80%">
</figure>
]

???

Source: https://en.wikipedia.org/wiki/Cornell_Big_Red

---

# Scraping tables using `.wikitable` tag

Step 1: use `read_html()` to read in html from the url of interest

``` r
*big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red")
```

Step 2: use `html_element()` to extract the first table element

``` r
big_red |> 
* html_element(".wikitable")   # extract the first .wikitable
```

```
## {html_node}
## <table class="wikitable" style="">
## [1] <tbody>\n<tr>\n<th scope="col" style="background-color:#B31B1B;color:#FFF ...
```

---

# Scraping tables using `.wikitable` tag

Step 3: use `html_table()` to convert the table into a data frame

``` r
big_red_table <- big_red |> 
  html_element(".wikitable") |> # extract the first .wikitable
* html_table()                  # convert html to a data frame

head(big_red_table, 8)
```

```
## # A tibble: 8 × 2
##   `Men's sports` `Women's sports`
##   <chr>          <chr>           
## 1 Baseball       Basketball      
## 2 Basketball     Cross country   
## 3 Cross country  Equestrian      
## 4 Football       Fencing         
## 5 Golf           Field hockey    
## 6 Ice hockey     Gymnastics      
## 7 Lacrosse       Ice hockey      
## 8 Polo           Lacrosse
```

---

# Scraped data frames are data frames

``` r
tidy_big_red <- big_red_table |> 
  pivot_longer(everything(), names_to = "gender", values_to = "sport") |> 
  filter(sport != "" & !str_detect(sport, "^†")) # remove things that aren't sports

tidy_big_red
```

```
## # A tibble: 35 × 2
##    gender         sport        
##    <chr>          <chr>        
##  1 Men's sports   Baseball     
##  2 Women's sports Basketball   
##  3 Men's sports   Basketball   
##  4 Women's sports Cross country
##  5 Men's sports   Cross country
##  6 Women's sports Equestrian   
##  7 Men's sports   Football     
##  8 Women's sports Fencing      
##  9 Men's sports   Golf         
## 10 Women's sports Field hockey 
## # ℹ 25 more rows
```

---

# Scraped data frames are data frames

What function(s) could we use to determine how many gender category-sport pairs there are in `tidy_big_red`?

.pull-left[

``` r
tidy_big_red |> 
  count()
```

```
## # A tibble: 1 × 1
##       n
##   <int>
## 1    35
```
]

.pull-right[

``` r
tidy_big_red |> 
  nrow()
```

```
## [1] 35
```
]

(Or we could have gone back one slide to look at the tibble header...)

---

# Scraped data frames are data frames

What function(s) could we use to determine how many distinct sports there are in `tidy_big_red`?

.pull-left[

``` r
tidy_big_red |> 
  distinct(sport) |> 
  count()
```

```
## # A tibble: 1 × 1
##       n
##   <int>
## 1    25
```
]

.pull-right[

``` r
tidy_big_red |> 
  select(sport) |> 
  n_distinct()
```

```
## [1] 25
```
]

---

# Scraped data frames are data frames

What function could we use to determine how many distinct sports are there for each gender category?

``` r
tidy_big_red |> 
  count(gender)
```

```
## # A tibble: 2 × 2
##   gender             n
##   <chr>          <int>
## 1 Men's sports      17
## 2 Women's sports    18
```

---
name: college-rankings

# Example 2: College rankings on Wikipedia

How could we scrape college rankings?

.center[
<figure>
  <a href="https://web.archive.org/web/20220405170508/https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States">
    <img src="img/13/college-rankings.png" width="75%">
  </a>
</figure>
]

.tiny[
*The site has changed over time, so we will scrape an archive from [The Wayback Machine](https://web.archive.org/web/20220405170508/https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States). One of web scraping's many challenges!*
]

???

Source: https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States#U.S._News_&_World_Report_Best_Colleges_Ranking

---

# Use `.wikitable` tag to get the first table

``` r
rankings <- read_html("https://web.archive.org/web/20220405170508/https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States")

first_table <- rankings |> 
* html_element(".wikitable") |> # extract the first .wikitable
* html_table()                  # convert html to a data frame

first_table
```

```
## # A tibble: 21 × 5
##    Top national universit…¹ `2022 rank` ``    Top liberal arts col…² `2022 rank`
##    <chr>                          <int> <lgl> <chr>                        <int>
##  1 Princeton University               1 NA    Williams College                 1
##  2 Columbia University                2 NA    Amherst College                  2
##  3 Harvard University                 2 NA    Swarthmore College               3
##  4 Massachusetts Institute…           2 NA    Pomona College                   4
##  5 Yale University                    5 NA    Wellesley College                5
##  6 Stanford University                6 NA    Bowdoin College                  6
##  7 University of Chicago              6 NA    United States Naval A…           6
##  8 University of Pennsylva…           8 NA    Claremont McKenna Col…           8
##  9 California Institute of…           9 NA    Carleton College                 9
## 10 Duke University                    9 NA    Middlebury College               9
## # ℹ 11 more rows
## # ℹ abbreviated names: ¹`Top national universities[13]`,
## #   ²`Top liberal arts colleges[14]`
```

---

# Scraped data frames are data frames

How does Cornell stack up?

How could we find it within a table with many other schools?

``` r
first_table |> 
  select(uni = 1, rank = 2) |>       # select and rename the first two columns
* filter(str_detect(uni, "Cornell")) # use pattern matching to find Cornell
```

```
## # A tibble: 1 × 2
##   uni                 rank
##   <chr>              <int>
## 1 Cornell University    17
```

---

# What if CSS selectors match multiple tables?

.pull-left[
<figure>
    <img src="img/13/college-rankings-us-news.png" width="100%">
</figure>
]

.pull-right[
<figure>
    <img src="img/13/college-rankings-parents-dream.png" width="100%">
</figure>
]

---

# What if CSS selectors match multiple tables?

#### Multiple options:

#### 1. Tweak CSS selectors to uniquely identify element (if possible)

#### 2. Scrape all of them, then use familiar R tools to extract data

Let's try option 2

---

# Scrape all the tables

Use `html_elements()` to extract all matching elements

``` r
all_tables <- rankings |> 
* html_elements(".wikitable") |> # extract all the .wikitables
  html_table()                   # convert html to a data frame
```

``` r
class(all_tables) # we get a list of tables
```

```
## [1] "list"
```

``` r
length(all_tables) # 11 tables, to be exact
```

```
## [1] 11
```

---

# How could we extract individual tables?

```
## # A tibble: 3 × 2
##   `Top national universities[13]` `2022 rank`
##   <chr>                                 <int>
## 1 Princeton University                      1
## 2 Columbia University                       2
## 3 Harvard University                        2
```

```
## # A tibble: 3 × 2
##   University                            `Students' Dream  College Ranking`
##   <chr>                                                              <int>
## 1 Stanford University                                                    1
## 2 Harvard University                                                     2
## 3 University of California, Los Angeles                                  3
```

```
## # A tibble: 3 × 2
##   University                            `Parents' Dream  College Ranking`
##   <chr>                                                             <int>
## 1 Stanford University                                                   1
## 2 Princeton University                                                  2
## 3 Massachusetts Institute of Technology                                 3
```

---

# String matching again!

``` r
# use str_detect() to search for tables with "Parents"
str_detect(all_tables, "Parents")
```

```
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
```

``` r
# or use str_which() to get position of matching object(s)
str_which(all_tables, "Parents")
```

```
## [1] 8
```

---

# You are fulfilling your parents' dreams

``` r
# now extract table(s) with "Parents"
# below we use `[]` syntax to extract the table by index
# this is because because all_tables is a list, not a data frame
all_tables[str_detect(all_tables, "Parents")]
```

```
## [[1]]
## # A tibble: 10 × 2
*##    University                            `Parents' Dream  College Ranking`
##    <chr>                                                             <int>
##  1 Stanford University                                                   1
##  2 Princeton University                                                  2
##  3 Massachusetts Institute of Technology                                 3
##  4 Harvard University                                                    4
##  5 New York University                                                   5
##  6 University of Pennsylvania                                            6
##  7 University of Michigan                                                7
##  8 Duke University                                                       8
##  9 University of California, Los Angeles                                 9
*## 10 Cornell University                                                   10
```

---
class: inverse, center, middle
name: group-project

# Group project

---

# Overall feedback

Good job!

Overall we were pleased with everyone's work

This assignment was meant to push you, and it was interesting to see the approaches different groups took

---

# Group project highlights

Many groups included things like executive summaries, a table of contents, etc. to tie the report together

One group went above and beyond by providing a secondary visualization that they thought improved on the one we had asked for

AEM 5850 groups produced clear slide decks with key results
- Groups rose to the challenge of using our old tools to output something new
- One group added some real polish in post-processing
- Fun fact: quarto can use powerpoint templates

---

# Grading

Median grade was 90%

We posted scores along with feedback on canvas

Please email me, Victor, and Xiaorui if you have any questions about grading
- Do this in a single email so we all have access to the same information

Questions?

---
class: inverse, center, middle
name: prelim-2

# Prelim 2

---

# Motivation

Use prelims to assess what you have each learned this semester

Benefit from using a mix of assessments that provide more signal, less noise

Learn from experience of other programming courses on campus

---

# Prelim 2 last year

Last year, prelim 2 was a mix of data visualization and programming concepts

Roughly 1/2 of the points did not require writing any code

Examples:
1. "Describe four changes you would make to improve this data visualization without losing any information."
2. "Describe three changes you could make to this visualization to better illustrate the ideas in the text above."

---

# Prelim 2 this year

This year, prelim 2 will emphasize the same content

Format: paper exam, closed-book (-notes, -computer, etc.)

It will stress concepts more than syntax (but will test both)

We will have multiple question types, including but not limited to:
1. improve this data visualization
2. explain how you would approach a coding task
3. explain what this function call would return
4. explain whether/why this code would fail
5. write code snippets

If we ask you to write code snippets, we will provide a function reference sheet

---

# Prelim 2 preparation

We will provide practice questions next week

I will offer extra office hours by appointment

The TAs will have regular open office hours May 2 and May 5

We are working to schedule an extra review session on May 5 in the late afternoon

Questions?