Web Scraping of RealMoney Contributor Articles Using Python and Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Real Money is a website dedicated to investment news and blog articles written by financial professionals. The website features numerous professionals with various trading specialties and expertise. The script automatically traverses the news listing for a site contributor and captures the high-level metadata of his/her blogs by storing them in a CSV output file.

Starting URLs: https://realmoney.thestreet.com/author/269/jim-cramer/all.html

The source code and HTML output can be found here on GitHub.

Web Scraping of AWS Open Data Registry Using Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: The Registry of Open Data on AWS makes datasets publicly available through AWS services. When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products. Sharing data in the cloud also lets data users spend more time on data analysis rather than data acquisition. The script automatically traverses the dataset listing and capture the descriptive data by storing them in a CSV output file.

Starting URLs: https://registry.opendata.aws/

The source code and HTML output can be found here on GitHub.

Web Scraping of Books to Scrape Using Selenium Take 2

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Books to Scarpe is a fictional bookstore that desperately wants to be scraped according to its site owner. It is a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. This iteration of the script automatically traverses the book listing and detail web pages to capture all the descriptive data about the books and store them in a CSV output file.

Starting URLs: http://books.toscrape.com/

The source code and HTML output can be found here on GitHub.

Web Scraping of Books to Scrape Using Selenium Take 1

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Books to Scarpe is a fictional bookstore that desperately wants to be scraped according to the site owner. It is a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. This iteration of the script automatically traverses the book listing web pages (about 50 pages and 1000 items) to capture all the basic data about the books and store them in a CSV output file.

Starting URLs: http://books.toscrape.com/

The source code and HTML output can be found here on GitHub.

Web Scraping of Metro Ridership Statistics Using Python and Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Metro is a transportation planner and coordinator, designer, builder and operator for one of the country’s largest, most populous counties, Los Angeles. More than 9.6 million people, nearly one-third of California’s residents, live, work and play within its 1,433-square-mile service area. The purpose of this exercise is to practice web scraping by gathering the bus ridership statistics from the agency’s web pages. This iteration of the script automatically traverses the monthly web pages (from January 2009 to June 2020) to capture all bus ridership entries and store the information in a CSV output file.

Starting URLs: http://isotp.metro.net/MetroRidership/Index.aspx

The source code and HTML output can be found here on GitHub.

Web Scraping of SAS Global Forum 2020 Proceedings Using BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping python code leverages the BeautifulSoup module.

INTRODUCTION: The SAS Global Forum covers the full range of topics in using SAS products and developing SAS solutions. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process. The Python script ran in the Google Colaboratory environment and can be adapted to run in any Python environment without the Colab-specific configuration.

Starting URLs: https://www.sas.com/en_us/events/sas-global-forum/program/proceedings.html

The source code and HTML output can be found here on GitHub.

Web Scraping of O’Reilly Software Architecture Conference 2020 New York Using Python

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Software Architecture Conference covers the full range of topics in the software architecture discipline. Those topics include leadership and business skills, product management, and domain-driven design. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents. The script will also download the documents as part of the scraping process. The Python script ran in the Google Colaboratory environment and can be adapted to run in any Python environment without the Colab-specific configuration.

Starting URLs: https://conferences.oreilly.com/software-architecture/sa-ny/public/schedule/proceedings

The source code and HTML output can be found here on GitHub.

Web Scraping of Data.gov Dataset Catalog Using Python and Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Data.gov is a government data repository website managed and hosted by the U.S. General Services Administration. The purpose of this exercise is to practice web scraping by gathering the dataset entries from Data.gov’s web pages. This iteration of the script automatically traverses the web pages to capture all dataset entries and store all captured information in a JSON output file.

Starting URLs: https://catalog.data.gov/dataset

The source code and HTML output can be found here on GitHub.