Web Scraping of Machine Learning Mastery Articles Using Python and BeautifulSoup

SUMMARY: This project aims to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

Dr. Jason Brownlee’s Machine Learning Mastery hosts its tutorial lessons at https://machinelearningmastery.com/blog. The purpose of this exercise is to practice web scraping by gathering the blog entries from Machine Learning Mastery’s web pages. This iteration of the script automatically traverses the web pages to capture all articles and store the captured information in a CSV output file for sorting and filtering.

Starting URL: https://machinelearningmastery.com/blog

The source code and HTML output can be found here on GitHub.

Web Scraping of Haodoo Backup Using Python and BeautifulSoup Take 2

SUMMARY: This project aims to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: Haodoo is a website that houses classic Chinese literature for its readers’ enjoyment. Haodoo in Chinese can be translated to “Good Reads” in English. It collects hard-to-find Chinese text/books and makes them available for online reading. The Haodoo collection includes over 3,500 titles of text and audiobooks.

In the previous Take1 iteration, we scraped the website and obtained all the book titles and their assigned categories. In this Take2 iteration, we will use the information collected from Take1 and obtain the links for each book and file format.

Starting URL: https://haodoo.org

The source code and HTML output can be found here on GitHub.

Web Scraping of Haodoo Backup Using Python and BeautifulSoup Take 1

SUMMARY: This project aims to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: Haodoo is a website that houses classic Chinese literature for its readers’ enjoyment. Haodoo in Chinese can be translated to “Good Reads” in English. It collects hard-to-find Chinese text/books and makes them available for online reading. The Haodoo collection includes over 3,500 titles of text and audiobooks.

In this Take1 iteration, we will scrape the website and obtain all the book titles and their assigned categories.

Starting URL: https://haodoo.org

The source code and HTML output can be found here on GitHub.

Web Scraping of NeurIPS Conference Proceedings 2020 Using Python and BeautifulSoup

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping Python code leverages the BeautifulSoup module.

INTRODUCTION: The Conference on Neural Information Processing Systems (NeurIPS) covers a wide range of topics in neural information processing systems and research for the biological, technological, mathematical, and theoretical applications. Neural information processing is a field that benefits from a combined view of biological, physical, mathematical, and computational sciences. This web scraping script will automatically traverse through the entire web page and collect all links to the PDF and PPTX documents.

Starting URL: https://proceedings.neurips.cc/paper/2020

The source code and HTML output can be found here on GitHub.

Web Scraping of RealMoney Contributor Articles Using Python and Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Real Money is a website dedicated to investment news and blog articles written by financial professionals. The website features numerous professionals with various trading specialties and expertise. The script automatically traverses the news listing for a site contributor and captures the high-level metadata of his/her blogs by storing them in a CSV output file.

Starting URLs: https://realmoney.thestreet.com/author/269/jim-cramer/all.html

The source code and HTML output can be found here on GitHub.

Web Scraping of AWS Open Data Registry Using Selenium

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: The Registry of Open Data on AWS makes datasets publicly available through AWS services. When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products. Sharing data in the cloud also lets data users spend more time on data analysis rather than data acquisition. The script automatically traverses the dataset listing and capture the descriptive data by storing them in a CSV output file.

Starting URLs: https://registry.opendata.aws/

The source code and HTML output can be found here on GitHub.

Web Scraping of Books to Scrape Using Selenium Take 2

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Books to Scarpe is a fictional bookstore that desperately wants to be scraped according to its site owner. It is a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. This iteration of the script automatically traverses the book listing and detail web pages to capture all the descriptive data about the books and store them in a CSV output file.

Starting URLs: http://books.toscrape.com/

The source code and HTML output can be found here on GitHub.

Web Scraping of Books to Scrape Using Selenium Take 1

SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The Python web scraping code leverages the Selenium module.

INTRODUCTION: Books to Scarpe is a fictional bookstore that desperately wants to be scraped according to the site owner. It is a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. This iteration of the script automatically traverses the book listing web pages (about 50 pages and 1000 items) to capture all the basic data about the books and store them in a CSV output file.

Starting URLs: http://books.toscrape.com/

The source code and HTML output can be found here on GitHub.