class: center, middle # Web scraping .class-info[ **Week 13** AEM 2850 / 5850 : R for Business Analytics<br> Cornell Dyson<br> Spring 2025 <!-- Acknowledgements: --> <!-- [Andrew Heiss](https://datavizm20.classes.andrewheiss.com) --> <!-- [Claus Wilke](https://wilkelab.org/SDS375/) --> <!-- [Grant McDermott](https://github.com/uo-ec607/lectures) --> <!-- [Jenny Bryan](https://stat545.com/join-cheatsheet.html), --> <!-- [Allison Horst](https://github.com/allisonhorst/stats-illustrations) --> <!-- [Ed Rubin](https://github.com/edrubin/EC524W22) --> ] --- # Announcements Only two full weeks left! <!-- Group project graded, will discuss at end of class --> Remaining deadlines: - Homework-13 will be due Monday - Homework-14 will be our example in class next Thursday - Prelim 2 on May 6 in class (two weeks from today) <!-- - I will give more guidance on this soon --> Questions before we get started? --- # Plan for today [Web scraping basics](#web-scraping-basics) [Web scraping with rvest](#rvest) - [Cornell sports](#cornell-sports) - [College rankings](#college-rankings) [Group project debrief](#group-project) [Prelim 2](#prelim-2) --- class: inverse, center, middle name: web-scraping-basics # Web scraping basics --- # What is web scraping? -- Getting data or "content" off the web and onto our computers -- We get content off the web all the time! - Copy and paste - Read and take notes - Screenshot -- The goal of web **scraping** is to write computer code to help us automate this process and store the results in a machine-readable format --- # Why would we want to scrape data? When is web scraping useful? -- - When the data is publicly available - When you can't get the data in a more convenient format -- When is web scraping not useful? -- - When data is publicly available in other formats (e.g., csv) - When the site owner offers a way to access data directly (e.g., via an API) -- Web scraping is time consuming and costly (for both you and "them") --- # Server-side vs client-side content ### 1. Server-side - Host server "builds" site and sends HTML code that our browser renders - All the information is embedded in the website's HTML -- ### 2. Client-side - Site contains an empty template of HTML and CSS - When we visit, our browser sends a *request* to the host server - The server sends a *response* script that our browser uses to populate the HTML template with information we want -- **We will focus on server-side web scraping due to time constraints** --- # What is HTML? -- HTML stands for "HyperText Markup Language" and looks like this: ``` r <html> <head> <title>Page title</title> </head> <body> <h1 id='first'>A heading</h1> <p>Some text & <b>some bold text.</b></p> <img src='myimg.png' width='100' height='100'> </body> ``` --- # What is HTML? HTML has a hierarchical structure formed by **elements** that consist of: 1. a start tag - optional attributes 2. contents 3. an end tag ??? Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html) --- # What is HTML? HTML has a hierarchical structure formed by **elements** that consist of: 1. a start tag (e.g., `<h1>`) - optional attributes (e.g., `id='first'`) 2. contents in between tags (e.g., `A heading`) 3. an end tag (e.g., `</h1>`) ``` r <html> <head> <title>Page title</title> </head> <body> * <h1 id='first'>A heading</h1> <p>Some text & <b>some bold text.</b></p> <img src='myimg.png' width='100' height='100'> </body> ``` ??? Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html) --- # What is HTML? **Elements** - There are over 100 HTML elements - Google tags to learn about them as needed -- **Contents** - Most elements can have content in between start and end tags - Content can be text or more elements (as **children**) -- **Attributes** - Attributes like `id` and `class` are used with CSS to control page appearance - These attributes are useful for scraping data ??? Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html) --- # What is CSS? -- CSS stands for **C**ascading **S**tyle **S**heets - Tool for defining visual appearance of HTML -- **CSS selectors** help identify what we want to scrape -- We will learn by example using the extension/bookmarklet [SelectorGadget](https://selectorgadget.com) ??? Source: [https://rvest.tidyverse.org/articles/rvest.html](https://rvest.tidyverse.org/articles/rvest.html) --- class: inverse, center, middle name: rvest # Web scraping with rvest --- # The rvest package [rvest](https://rvest.tidyverse.org/index.html) (as in "harvest") is part of the tidyverse ``` r library(rvest) # installed with tidyverse but needs to be loaded ``` -- We will cover several functions that make it easy to scrape data from web pages: - `read_html` reads HTML, much like `read_csv` reads .csv files - `html_element(s)` find HTML elements using CSS selectors or XPath expressions - `html_text2` retrieves text from HTML elements - `html_table` parses HTML tables into data frames -- Let's learn these commands by working through two examples --- name: cornell-sports # Example 1: Cornell Big Red on Wikipedia How could we scrape a list of varsity sports? .center[ <figure> <a href="https://en.wikipedia.org/wiki/Cornell_Big_Red"> <img src="img/13/big-red.png" width="90%"> </a> </figure> ] ??? Source: https://en.wikipedia.org/wiki/Cornell_Big_Red --- # Option 1: use `dt` tag to get headings <figure> <img src="img/13/big-red-selector-text.png" width="100%"> </figure> ??? Source: https://en.wikipedia.org/wiki/Cornell_Big_Red --- # Scraping text using `dt` tag Step 1: use `read_html()` to read in html from the url of interest ``` r *big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red") big_red ``` ``` ## {html_document} ## <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ... ## [2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ... ``` --- # Scraping text using `dt` tag Step 2: use `html_elements()` to extract every instance of a `dt` tag ``` r big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red") big_red |> * html_elements("dt") |> # dt tag is for terms in a description list head(8) ``` ``` ## {xml_nodeset (8)} ## [1] <dt>Baseball</dt> ## [2] <dt>Men's basketball</dt> ## [3] <dt>Women's basketball</dt> ## [4] <dt>Men's cross country</dt> ## [5] <dt>Women's cross country</dt> ## [6] <dt>Women's fencing</dt> ## [7] <dt>Football</dt> ## [8] <dt>Sprint football</dt> ``` --- # Scraping text using `dt` tag Step 3: use `html_text2()` to convert the sports to a character vector ``` r big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red") big_red_text <- big_red |> html_elements("dt") |> # dt tag is for terms in a description list * html_text2() # convert html to text head(big_red_text) # looks good! ``` ``` ## [1] "Baseball" "Men's basketball" "Women's basketball" ## [4] "Men's cross country" "Women's cross country" "Women's fencing" ``` -- .less-left[ ``` r length(big_red_text) # hmm... ``` ``` ## [1] 83 ``` ] -- .more-right[ ``` r tail(big_red_text) # uh-oh... ``` ``` ## [1] "WFTDA" "MRDA" "USARL" "NARL" "USAR" "WTT" ``` ] -- That doesn't seem right... --- # What went wrong? -- .less-left[ 1. Got irrelevant data ] .more-right[ <figure> <img src="img/13/big-red-selector-external.png" width="100%"> </figure> ] --- # What went wrong? .less-left[ 1. Got irrelevant data 2. Didn't get relevant data ] .more-right.center[ <figure> <img src="img/13/big-red-selector-other.png" width="65%"> </figure> ] --- # Option 2: use `.wikitable` tag to get table .center[ <figure> <img src="img/13/big-red-selector-table.png" width="80%"> </figure> ] ??? Source: https://en.wikipedia.org/wiki/Cornell_Big_Red --- # Scraping tables using `.wikitable` tag Step 1: use `read_html()` to read in html from the url of interest ``` r *big_red <- read_html("https://en.wikipedia.org/wiki/Cornell_Big_Red") ``` -- Step 2: use `html_element()` to extract the first table element ``` r big_red |> * html_element(".wikitable") # extract the first .wikitable ``` ``` ## {html_node} ## <table class="wikitable" style=""> ## [1] <tbody>\n<tr>\n<th scope="col" style="background-color:#B31B1B;color:#FFF ... ``` --- # Scraping tables using `.wikitable` tag Step 3: use `html_table()` to convert the table into a data frame ``` r big_red_table <- big_red |> html_element(".wikitable") |> # extract the first .wikitable * html_table() # convert html to a data frame head(big_red_table, 8) ``` ``` ## # A tibble: 8 × 2 ## `Men's sports` `Women's sports` ## <chr> <chr> ## 1 Baseball Basketball ## 2 Basketball Cross country ## 3 Cross country Equestrian ## 4 Football Fencing ## 5 Golf Field hockey ## 6 Ice hockey Gymnastics ## 7 Lacrosse Ice hockey ## 8 Polo Lacrosse ``` --- # Scraped data frames are data frames ``` r tidy_big_red <- big_red_table |> pivot_longer(everything(), names_to = "gender", values_to = "sport") |> filter(sport != "" & !str_detect(sport, "^†")) # remove things that aren't sports tidy_big_red ``` ``` ## # A tibble: 35 × 2 ## gender sport ## <chr> <chr> ## 1 Men's sports Baseball ## 2 Women's sports Basketball ## 3 Men's sports Basketball ## 4 Women's sports Cross country ## 5 Men's sports Cross country ## 6 Women's sports Equestrian ## 7 Men's sports Football ## 8 Women's sports Fencing ## 9 Men's sports Golf ## 10 Women's sports Field hockey ## # ℹ 25 more rows ``` --- # Scraped data frames are data frames What function(s) could we use to determine how many gender category-sport pairs there are in `tidy_big_red`? -- .pull-left[ ``` r tidy_big_red |> count() ``` ``` ## # A tibble: 1 × 1 ## n ## <int> ## 1 35 ``` ] .pull-right[ ``` r tidy_big_red |> nrow() ``` ``` ## [1] 35 ``` ] -- (Or we could have gone back one slide to look at the tibble header...) --- # Scraped data frames are data frames What function(s) could we use to determine how many distinct sports there are in `tidy_big_red`? -- .pull-left[ ``` r tidy_big_red |> distinct(sport) |> count() ``` ``` ## # A tibble: 1 × 1 ## n ## <int> ## 1 25 ``` ] .pull-right[ ``` r tidy_big_red |> select(sport) |> n_distinct() ``` ``` ## [1] 25 ``` ] --- # Scraped data frames are data frames What function could we use to determine how many distinct sports are there for each gender category? -- ``` r tidy_big_red |> count(gender) ``` ``` ## # A tibble: 2 × 2 ## gender n ## <chr> <int> ## 1 Men's sports 17 ## 2 Women's sports 18 ``` --- name: college-rankings # Example 2: College rankings on Wikipedia How could we scrape college rankings? .center[ <figure> <a href="https://web.archive.org/web/20220405170508/https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States"> <img src="img/13/college-rankings.png" width="75%"> </a> </figure> ] .tiny[ *The site has changed over time, so we will scrape an archive from [The Wayback Machine](https://web.archive.org/web/20220405170508/https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States). One of web scraping's many challenges!* ] ??? Source: https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States#U.S._News_&_World_Report_Best_Colleges_Ranking --- # Use `.wikitable` tag to get the first table ``` r rankings <- read_html("https://web.archive.org/web/20220405170508/https://en.wikipedia.org/wiki/College_and_university_rankings_in_the_United_States") first_table <- rankings |> * html_element(".wikitable") |> # extract the first .wikitable * html_table() # convert html to a data frame first_table ``` ``` ## # A tibble: 21 × 5 ## Top national universit…¹ `2022 rank` `` Top liberal arts col…² `2022 rank` ## <chr> <int> <lgl> <chr> <int> ## 1 Princeton University 1 NA Williams College 1 ## 2 Columbia University 2 NA Amherst College 2 ## 3 Harvard University 2 NA Swarthmore College 3 ## 4 Massachusetts Institute… 2 NA Pomona College 4 ## 5 Yale University 5 NA Wellesley College 5 ## 6 Stanford University 6 NA Bowdoin College 6 ## 7 University of Chicago 6 NA United States Naval A… 6 ## 8 University of Pennsylva… 8 NA Claremont McKenna Col… 8 ## 9 California Institute of… 9 NA Carleton College 9 ## 10 Duke University 9 NA Middlebury College 9 ## # ℹ 11 more rows ## # ℹ abbreviated names: ¹`Top national universities[13]`, ## # ²`Top liberal arts colleges[14]` ``` --- # Scraped data frames are data frames How does Cornell stack up? -- How could we find it within a table with many other schools? -- ``` r first_table |> select(uni = 1, rank = 2) |> # select and rename the first two columns * filter(str_detect(uni, "Cornell")) # use pattern matching to find Cornell ``` ``` ## # A tibble: 1 × 2 ## uni rank ## <chr> <int> ## 1 Cornell University 17 ``` --- # What if CSS selectors match multiple tables? .pull-left[ <figure> <img src="img/13/college-rankings-us-news.png" width="100%"> </figure> ] .pull-right[ <figure> <img src="img/13/college-rankings-parents-dream.png" width="100%"> </figure> ] --- # What if CSS selectors match multiple tables? #### Multiple options: #### 1. Tweak CSS selectors to uniquely identify element (if possible) #### 2. Scrape all of them, then use familiar R tools to extract data -- Let's try option 2 --- # Scrape all the tables Use `html_elements()` to extract all matching elements ``` r all_tables <- rankings |> * html_elements(".wikitable") |> # extract all the .wikitables html_table() # convert html to a data frame ``` -- ``` r class(all_tables) # we get a list of tables ``` ``` ## [1] "list" ``` -- ``` r length(all_tables) # 11 tables, to be exact ``` ``` ## [1] 11 ``` --- # How could we extract individual tables? ``` ## # A tibble: 3 × 2 ## `Top national universities[13]` `2022 rank` ## <chr> <int> ## 1 Princeton University 1 ## 2 Columbia University 2 ## 3 Harvard University 2 ``` ``` ## # A tibble: 3 × 2 ## University `Students' Dream College Ranking` ## <chr> <int> ## 1 Stanford University 1 ## 2 Harvard University 2 ## 3 University of California, Los Angeles 3 ``` ``` ## # A tibble: 3 × 2 ## University `Parents' Dream College Ranking` ## <chr> <int> ## 1 Stanford University 1 ## 2 Princeton University 2 ## 3 Massachusetts Institute of Technology 3 ``` --- # String matching again! ``` r # use str_detect() to search for tables with "Parents" str_detect(all_tables, "Parents") ``` ``` ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE ``` -- ``` r # or use str_which() to get position of matching object(s) str_which(all_tables, "Parents") ``` ``` ## [1] 8 ``` --- # You are fulfilling your parents' dreams ``` r # now extract table(s) with "Parents" # below we use `[]` syntax to extract the table by index # this is because because all_tables is a list, not a data frame all_tables[str_detect(all_tables, "Parents")] ``` ``` ## [[1]] ## # A tibble: 10 × 2 *## University `Parents' Dream College Ranking` ## <chr> <int> ## 1 Stanford University 1 ## 2 Princeton University 2 ## 3 Massachusetts Institute of Technology 3 ## 4 Harvard University 4 ## 5 New York University 5 ## 6 University of Pennsylvania 6 ## 7 University of Michigan 7 ## 8 Duke University 8 ## 9 University of California, Los Angeles 9 *## 10 Cornell University 10 ``` --- class: inverse, center, middle name: group-project # Group project --- # Overall feedback Good job! Overall we were pleased with everyone's work This assignment was meant to push you, and it was interesting to see the approaches different groups took --- # Group project highlights Many groups included things like executive summaries, a table of contents, etc. to tie the report together One group went above and beyond by providing a secondary visualization that they thought improved on the one we had asked for AEM 5850 groups produced clear slide decks with key results - Groups rose to the challenge of using our old tools to output something new - One group added some real polish in post-processing - Fun fact: quarto can use powerpoint templates --- # Grading Median grade was 90% We posted scores along with feedback on canvas Please email me, Victor, and Xiaorui if you have any questions about grading - Do this in a single email so we all have access to the same information Questions? --- class: inverse, center, middle name: prelim-2 # Prelim 2 --- # Motivation Use prelims to assess what you have each learned this semester Benefit from using a mix of assessments that provide more signal, less noise Learn from experience of other programming courses on campus --- # Prelim 2 last year Last year, prelim 2 was a mix of data visualization and programming concepts Roughly 1/2 of the points did not require writing any code Examples: 1. "Describe four changes you would make to improve this data visualization without losing any information." 2. "Describe three changes you could make to this visualization to better illustrate the ideas in the text above." --- # Prelim 2 this year This year, prelim 2 will emphasize the same content Format: paper exam, closed-book (-notes, -computer, etc.) It will stress concepts more than syntax (but will test both) We will have multiple question types, including but not limited to: 1. improve this data visualization 2. explain how you would approach a coding task 3. explain what this function call would return 4. explain whether/why this code would fail 5. write code snippets If we ask you to write code snippets, we will provide a function reference sheet --- # Prelim 2 preparation We will provide practice questions next week I will offer extra office hours by appointment The TAs will have regular open office hours May 2 and May 5 We are working to schedule an extra review session on May 5 in the late afternoon Questions?