SUMMARY: The purpose of this project is to practice web scraping by extracting specific pieces of information from a website. The web scraping code was written in Python 3 and leveraged the Scrapy framework [https://scrapy.org/] maintained by Scrapinghub [https://scrapinghub.com/].
INTRODUCTION: A demo website, created by Scrapinghub, lists quotes from famous people. It has many endpoints showing the quotes in different ways, and each endpoint presents a different scraping challenge for practicing web scraping. For this Take3 iteration, the Python script attempts to scrape the quote information that is displayed via an infinite scrolling page.
Starting URLs: http://quotes.toscrape.com/scroll
import json import scrapy class ScrollSpider(scrapy.Spider): name = "scroll" api_url = 'http://quotes.toscrape.com/api/quotes?page={}' start_urls = [api_url.format(1)] def parse(self, response): data = json.loads(response.text) for quote in data['quotes']: yield { 'author_name': quote['author']['name'], 'text': quote['text'], 'tags': quote['tags'], 'author_url': quote['author']['goodreads_link'], } # follow pagination link if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(url=self.api_url.format(next_page), callback=self.parse)
The source code and JSON output can be found here on GitHub.