Scrapy is a powerful, free and open source Python web crawling framework. It’s used for extracting data from the web, and is useful for a wide variety of applications, like data mining or historical archival.
It’s designed around “spiders” that have a set of instructions for what they need to do, and it makes it easy to build large, complex crawling projects.
Spiders send requests to the Scrapy Engine, which controls the flow of data between all components of the framework and triggers events when certain actions occur.
The Scrapy Engine then dispatches the requests to the Scheduler, which manages the collection of request data and dispatches them to Spiders for processing.
Once the Spiders have completed their job, they return a response object to the Engine which will then use it to send further requests.
These requests can be in any format that the Scrapy Engine supports, and can include read more text, XML, HTML, and JSON.
This is a very common format for scraping, so it’s important that we make sure our Scrapy crawlers support it!
It’s also a good idea to check the scraper’s log file for errors, and to add in some email notifications when things go wrong.
Getting Started
The best place to start with Scrapy is by learning how to write simple crawlers that iterate through pages and extract useful data from them. This can be done in a number of ways, but one of the most efficient is to start with a basic web page.
To make a request in Scrapy, you need to specify the URL of the web page you want to scrape and a callback function that will be called when a response is received from it. This makes Scrapy work asynchronously and allows it to process multiple requests in parallel.
Selectors are the core of Scrapy’s crawling capabilities, and are built on top of a powerful XPath expression system. Using selectors is a great way to quickly and efficiently find elements of a web page that you can then extract data from.
XPath is an incredibly versatile programming language, and it’s used in a variety of languages including Python. It’s an extremely flexible way of parsing and interpreting HTML, making it ideal for scraping websites that are often very complicated or dynamic.
When creating a CSS selector in Scrapy, be aware that sometimes class names and ID strings change. This can be an issue with modern front-end frameworks and it’s important to keep that in mind when creating selectors for your scrapers.
If you do happen to encounter this type of issue, it’s a good idea to use a separate Spider with an XPath-based CSS selector, and use the other Spider for your actual scraping.
Aside from CSS selectors, you can also use regular expressions for your scraping, as well as XPath to get the corresponding HTML elements.
These are all useful tools that make scraping easier and less painful, but be careful not to overuse them!