Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
update slides
  • Loading branch information
Tyler Hinrichs committed Feb 21, 2024
1 parent 18493d0 commit 01f1b8d
Show file tree
Hide file tree
Showing 5 changed files with 205 additions and 16 deletions.
Binary file added images/document-object-model.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/get-request.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/pandas-df.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
137 changes: 129 additions & 8 deletions web-scraping-for-sports-data.html

Large diffs are not rendered by default.

84 changes: 76 additions & 8 deletions web-scraping-for-sports-data.rmd
Expand Up @@ -35,24 +35,92 @@ output:

# Static Web Scraping

## Static Web Scraping
- Content is static
- All HTML is already loaded in
- We can parse through a static DOM object

**Python Libraries**

- **Requests**: HTTP Requests
- **BeautifulSoup**: Parsing HTML content
- Our main focus with static web scraping
- **Pandas**: Loading dataset into a dataframe
- Used in next steps in DS pipeline
- Useful for further data manipulation an analysis

## Using Requests

- Placeholder 1
- Placeholder 2
- HTTP requests
- Simplified: client makes a call to a server for an action
- GET, POST, DELETE, etc.
- Requests is easiest way for HTTP requests in Python
- We use **GET request** to retrieve static HTML content
- HTML retrieved is a snapshot
- Response also gives other metadata
- Status code, content type, etc.

<img src="images/get-request.png" width="350"/>

## Using BeautifulSoup

- Placeholder 1
- Placeholder 2
- Output of requests --> Input of BeautifulSoup
- Used to **parse HTML and XML**
- Document Object Model (DOM)
- Hierarchical structure of HTML
- We use this hierarchy to access elements

<img src="images/document-object-model.png" width="350"/>

## Accessing Elements with BeautifulSoup
- Pass HTML into html.parser
- soup_object.tag
- Navigating down: accessing elements' children
- .contents, .children
- Navigating up: accessing elements' parents
- .parent, .parents
- find_all(filters: str | True | function | list | regex expr)
- Returns descendants matching the filters
- select(CSS Selector)
- For full documentation: https://beautiful-soup-4.readthedocs.io/en/latest/#

## Using Pandas

- Placeholder 1
- Placeholder 2
- We will use briefly here, but it is extremely important for next steps
- DataFrame: 2-dimensional data structure that holds data similarly to an excel table or a SQL table
- We will create one for each of our web scraped datasets

<img src="images/pandas-df.png" width="400"/>

# Dynamic Web Scraping

## Dynamic Web Scraping
- Website may require user interaction
- JavaScript may be used at runtime to change elements on the page as a response to user interaction
- Need a way to automate users interacting with the site
- Image below: lazy loading, not all data present when screen loads

<img src="images/pl-results.png" width="450"/>

## Selenium

- Placeholder 1
- Placeholder 2
- Use WebDriver API to launch a web browser instance
- Can interact with the web browser dynamically through code
- Headless: can run browser with UI shown or not shown (headless)
- Can alter functionality, but headless can save resources
- Once interacted with, we can retrieve HTML content and use BeautifulSoup

## Retrieving Data with Selenium
- find_element()
- Pass in By.Method where Method is what type of criteria to search for
- Explicit and implicit waits
- Expected conditions
- ActionChains

## Resources
- requests: https://requests.readthedocs.io/en/latest/
- Beautiful Soup 4: https://beautiful-soup-4.readthedocs.io/en/latest/
- pandas: https://pandas.pydata.org/docs/
- Document Object Model (DOM) https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model
- HTTP: https://www.ibm.com/docs/en/cics-ts/5.3?topic=protocol-http-requests
- Selenium: https://selenium-python.readthedocs.io/

0 comments on commit 01f1b8d

Please sign in to comment.