update dynamic and static notebooks

tnh19002 · Feb 21, 2024 · bdb7a3d · bdb7a3d
1 parent f0b6f3f
commit bdb7a3d
Show file tree

Hide file tree

Showing 4 changed files with 398 additions and 243 deletions.
diff --git a/dynamic_soccer_data.ipynb b/dynamic_soccer_data.ipynb
@@ -1,5 +1,29 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dynamic Web Scraping for Sports Data\n",
+    "In this notebook, we will expand on the concepts used in the static web scraping notebook. We will still be using the Premier League's official site, this time scraping all of the current game results for this season. We have some added complexity this time that makes dynamic web scraping necessary to have access to the data we want.\n",
+    "\n",
+    "The site we are working with is shown below. This page uses lazy loading, a principle that ensures that content is only loaded when a user requests it. In this case, \"requesting\" the data means scrolling down enough so that the end of the currently loaded content is reached. However, we want all the content in our dataset, and we will automate web actions with Selenium to access it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "![](images/pl-results.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The library we will be using is Selenium. This library is great with web automation, making it an option for many use cases, from anything such as web scraping (our use case) to website testing. Install Selenium below."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -9,6 +33,15 @@
     "%pip install selenium"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The key difference that sets Selenium apart from BeautifulSoup is the static vs. dynamic nature of the web scraping we are going to be doing. With BeautifulSoup, we retrieve HTML content from a page at a snapshot in time that we pass into a BeautifulSoup object, where it is then parsed.  After we make the get request, that HTML object is not going to automatically change; it's static. On the other hand, with Selenium, we use a webdriver to start an instance of a web browser. This API gives us dynamic access to the browser, meaning that we can interact with the site as a human might in realtime. In our use case, we will be approaching a simple but crucial problem: our data will not all load until we click away several elements and scroll down on the page. Once we do this, we can simply retrieve the HTML of the page, and then use BeautifulSoup as we did in the last example.\n",
+    "\n",
+    "We will be using the webdriver and Service APIs from Selenium to start. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -19,6 +52,13 @@
     "from selenium.webdriver.chrome.service import Service"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will be using the Premier League's results page to make a simple dataset that has information on every match in the leauge so far this season."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
@@ -28,6 +68,32 @@
     "results_url = \"https://www.premierleague.com/results\""
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The function below creates a webdriver, which gives us a programmatic entrypoint into the web browser. \n",
+    "\n",
+    "The code below uses Chrome, though other browsers can be used (for more information: https://selenium-python.readthedocs.io/installation.html#drivers)\n",
+    "\n",
+    "*Note: in order to run the code, your ChromeDriver version must be identical to the version of Chrome installed on your device.*\n",
+    "\n",
+    "To download ChromeDriver:\n",
+    "- Chrome version <=114>: https://chromedriver.chromium.org/downloads\n",
+    "- Chrome version >114: https://googlechromelabs.github.io/chrome-for-testing/\n",
+    "\n",
+    "In the repository, I currently have ChromeDriver version 121.X.XXXX.XX."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This function returns a webdriver.Chrome instance using the ./chromedriver.exe path; therefore, place your chromedriver.exe file in the base directory of this project. An example chromedriver.exe file is already present, but ensure you use the correct version to avoid errors.\n",
+    "\n",
+    "We also have the concept of headless below: currently, when we run our webdriver, it will open with a visible and interactive version of the browser. Headless allows the chrome instance to run without a UI. This can cause some differences in functionality at times, and in this case we will keep headless off. However, it is a useful feature to allow web automation tasks to run in the background, and to save resources."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -47,6 +113,13 @@
     "    return webdriver.Chrome(service=Service(executable_path=chromedriver_path), options=options)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below is a simple instance of initializing a driver, making a get request, and then quitting the session."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -66,15 +139,18 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "### By Class\n",
     "Using the \"By\" class, we can access elements by many different methods:\n",
-    "ID = \"id\"\n",
-    "NAME = \"name\"\n",
-    "XPATH = \"xpath\"\n",
-    "LINK_TEXT = \"link text\"\n",
-    "PARTIAL_LINK_TEXT = \"partial link text\"\n",
-    "TAG_NAME = \"tag name\"\n",
-    "CLASS_NAME = \"class name\"\n",
-    "CSS_SELECTOR = \"css selector\""
+    "- ID = \"id\"\n",
+    "- NAME = \"name\"\n",
+    "- XPATH = \"xpath\"\n",
+    "- LINK_TEXT = \"link text\"\n",
+    "- PARTIAL_LINK_TEXT = \"partial link text\"\n",
+    "- TAG_NAME = \"tag name\"\n",
+    "- CLASS_NAME = \"class name\"\n",
+    "- CSS_SELECTOR = \"css selector\"\n",
+    "\n",
+    "This class is used in conjunction with the find_element() and find_elements() APIs to give us different ways of specifying criteria in which to look for elements in the dynamic DOM."
    ]
   },
   {
@@ -86,13 +162,19 @@
     "from selenium.webdriver.common.by import By"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Below, we will use the find_element() and click() APIs. The click() function does exactly what you might expect - given an element, it will perform a click on it. The code below closes two popups that show up when you enter the site without any cookies."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# using find_element() and click() APIs\n",
     "driver = get_driver()\n",
     "driver.get(results_url)\n",
     "\n",
@@ -109,6 +191,7 @@
    "metadata": {},
    "source": [
     "### Explicit and Implicit Waits\n",
+    "With dynamic websites, things load in from a variety of different sources. We can't expect every item we want to show up onscreen immediately. However, we might know a generally expected amount of time we will need to guarantee an element to exist in the DOM. Therefore, we can use waits to specify how much we want to pause before trying to see if an element is there.\n",
     "\n",
     "Explicit - Wait a specific amount of time to find a certain element\n",
     "Implicit - When finding any element, wait a certain amount of time\n",
@@ -117,12 +200,14 @@
     "\n",
     "### Expected Conditions\n",
     "\n",
-    "Can be used in conjuntion with waits - we wait EITHER for an expected condition to be true, or until the time limit is exceeded.\n",
+    "This API can be used in conjuntion with waits - we wait EITHER for an expected condition to be true, or until the time limit is exceeded.\n",
     "\n",
     "Examples of Expected Conditions (EC):\n",
     "- title_is\n",
     "- title_contains\n",
-    "- presence_of_element_located\n"
+    "- presence_of_element_located\n",
+    "\n",
+    "Below, we will use two waits, building on the last example. We simply wait for the presence of the elemnts we are trying to find, giving a 10 second buffer before an error is thrown. expected_conditions returns a boolean, and we can pass it into the .until() method of a WebDriverWait() object that if it evaluates to true, we will get the element returned, in which we can call a click().\n"
    ]
   },
   {
@@ -161,15 +246,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Using XPaths\n",
-    "We can also parse through the DOM with XPaths.\n",
+    "### Using XPaths\n",
+    "We can also parse through the DOM with XPaths. XPaths give us a different, but similar, way of parsing HTML data. XPaths are intended to be used with XML, but can often adequately work with HTML pages. We will not go extremely in depth to these concepts here, but you can read more about them below. We will use one XPath example in our next codeblock.\n",
     "\n",
     "More information: \n",
     "- https://www.w3schools.com/xml/xpath_intro.asp\n",
     "- https://scrapfly.io/blog/parsing-html-with-xpath/\n",
     "\n",
-    "# Scrolling using ActionChains\n",
-    "We can use the ActionChains library for various actions in the browser."
+    "### Scrolling using ActionChains\n",
+    "We can use the ActionChains library for various actions in the browser. In this case, we will pass the object returned by WebDriverWait to ActionChains in order to scroll to it, in turn loading the rest of the content on the page.\n",
+    "\n",
+    "We will use XPath to check for the existence of an element containing the text of the date that we want to scroll down to. Once we confirm this is true, we can scrape the HTML data from the page, close the browser, and parse as we did previously with BeautifulSoup, eventually creating a Pandas dataframe that is printed below."
    ]
   },
   {
@@ -282,20 +369,8 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Scrolling using"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Page Object Model Design Pattern"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "Selenium is a powerful library that we only scratched the surface of today. You can do other actions such as run JavaScript in the browser, simulate keyboard and mouse input, and interact with a wide variety of dynamic elements on the page. Your specific use case will determine what you need to use, but many of the concepts used in this notebook, although simple, can get you through most actions needed to retrieve data from a dynamic webpage.\n",
+    "\n",
     "For more information:\n",
     "- https://selenium-python.readthedocs.io/"
    ]

diff --git a/images/pl-results.png b/images/pl-results.png
diff --git a/pl-tables.png → images/pl-tables.png b/pl-tables.png → images/pl-tables.png
diff --git a/static_soccer_data.ipynb b/static_soccer_data.ipynb