A web crawler, also known as web spider, is an application able to scan the World Wide Web and extract information in an automatic manner. There is a huge amount of data in the network and web crawlers provide access to useful and relevant information with the goal of browsing as many web pages as possible. For this reason, search engines use web crawlers to discover available pages and stay up-to-date. Web crawlers can also be used to maintain websites automatically, scanning and validating HTML code or checking links, or to extract various information from websites with web scraping techniques.
A web crawler procedure begins with a list of ULRs to visit and when the crawler discovers a new page, it starts to analyze the page and try to identify all the hyperlinks, adding them to the list of URLs to visit. This process continues recursively as long as new resources are found. In this way, a web crawler can detect new URLs and, in relation to its goal, uses indexing methods to know exactly how to search that information again later — stores information in a file/database or downloads useful data, images, objects, etc. Indexing methods collect, parse and store the URLs found and their connections, from each page, to facilitate quick information retrieval, and generally create a very large lookup table about the pages that were covered in the crawling process. Googlebot and Bingboot are examples of the web crawlers of famous search engines.
To create the scraper I used Python because it is a dynamic, portable and performing language combined with an open source web crawler framework called Scrapy. Scrapy is the most popular tool for web crawling written in Python. It is simple and powerful, with lots of features and possible extensions. Scrapy uses selectors based on XPath mechanisms to extract data from a web page. Selectors are components that select part of the source code of web pages and perform data extractions from the HTML source. XPath is a language for select nodes in XML or HTML documents, using expressions to navigate a document and extract information. XPath is used to turn an HTML document into a hierarchical form to better organize information into a tree structure. Finally, of course, a Python IDE is required to write the code.
Be sure to have the latest version of Python installed and then install iPython, a more powerful and features-rich shell for the Python language, through pip command:
pip install iPython
Scrapy can easily be installed in the same way:
pip install Scrapy
Now Scrapy can be used typing:
scrapy shell 'URL_path'
With the above command, Scrapy provides a shell that contains the selector to the target web page and it is very useful to develop and debug the crawler, allowing you to execute commands and to navigate the 'URL_path' document without running the crawler. The shell will use the iPython console, instead of the Python console, which provides many advantages such as auto-completion, line numbers, color strings for more readability, help functions, advancing editing, explore objects, etc. The result on the shell is the following:
Fig 1: iPython shell
First of all we should create the scraper project with:
scrapy startproject myProject
The command to create the folder myProject and the project structure inside the folder is:
myProject / scrapy.cfg myProject / __init__.py items.py pipelines.py settings.py spiders/ __init__.py mySpider1.py mySpider2.py ...
The first myProject folder is the project root directory and contains the file scrapy.cfg that is the project configuration file and contains the name of the Python module that defines the project settings, such as:
[settings] default = myproject.settings
Items.py is the file with the list of items attributes we want to fill with the scraper. An example:
url = Field() website = Field(required=True) name = Field(required=True) description = Field() price = Field(type='decimal') currency = Field() images_urls = Field(type='list', default=list)
Where Field() identifies variables and all metadata associated with those variables. It is possible to specify, as arguments, the type of data of the variables (for example: decimal, bool, list), whether a variable is required or not, the default value, etc.
Pipelines.py is the file used to perform some actions on an item after it has been scraped by the spider. It is usually used to clean and validate data, check duplicates, store data in a database, etc. The pipelines file is very useful to share common methods among spiders and to perform global actions on items found.
Settings.py is the spider settings file and contains, for example, the bot name, the item_pipelines methods, the path indicating where to put the output, the log level, and many other settings — some of which we will discuss later in this article.
The folder spider contains all the spiders we have created.
Now we can start to write the scraper. Firstly, we need to import the libraries we will use and to set the global variable base_url of the spider:
import urlparse from scrapy.http import Request from scrapy.selector import Selector from scrapy.spider import Spider BASE_URL = 'http://www.target_site.com/'
The urlparse module provides functions for dividing URLs into their component parts and, vice versa, to combine the components back into a URL string. Scrapy.http manages Request/Response messages. Scrapy.selector is used to create a selector on a Response document and allow it, through XPath mechanisms, to extract data. Scrapy.spider is the simplest spider and the one that every other spider must inherit from. It doesn't provide any special functionality and calls the
parse method for each of the resulting responses. Base_url is a variable that refers to the first part of all the URLs found by the spider and will be used in the later functions to complete the absolute URLs.
The spider starts with the definition of the variables containing the name of the spider, the allowed domains where the spider is enabled to run, and the start-url to indicate where to begin the scan:
class MySpider(Spider): name = 'spiderName' allowed_domains = ['target_site.com'] start_urls = ['http://www.target_site.com/test1', 'http://www.target_site.com/test2']
Then, we define the first function of the spider, usually called parse, which parses the response, returning data found or more URLs to follow. If we propose to scrape an e-commerce site, this function should represent the main categories into which the items are divided:
def parse(self, response): sel = Selector(response) requests =  for link in sel.xpath('//ul/li/a]'): name = link.xpath('text()').extract().strip() url = link.xpath('@href').extract().strip() requests.append(Request( url=urlparse.urljoin(BASE_URL, url), callback=parse_subcategory)) return requests
sel is the selector of the page that contains all the data of the web page, so it will be used to extract data from the HTML code. It is possible to create the selector on the terminal and, thus, obtain the same result, as indicated previously, with the command:
scrapy shell 'URL_path'. Sometimes it can be extremely useful to have the availability of a selector on a target page in order to perform debug or ad-hoc operations on that defined URL and test the correctness before you write the code.
Requests is a vector that will contain the HTTP requests of the spider. Sel.xpath is a function that uses XPath syntax to extract data from the selector and returns a list of selectors — each of them representing the nodes identified by the expression used. We iterate the return vector and for each object we select the piece of HTML code that contains the name and the URL of the current category. Then, with extract() it is possible to extract the selected data from the selector into a Unicode string and with strip() to clean the final data from left-right whitespaces. Finally, it creates the HTTP request message for the next web pages to scan with the object requests. URL is the field that indicates the target URL of the request and is the composition of base_url + relative_url, and callback indicates the name of the function that will be called to manage the response of this request.
In relation to the structure of the website we want to scan, many nested functions, that represent each sub-category in which items are grouped, might be required to model the structure of the web pages more accurately. For brevity, nested functions are not proposed in the current article, because they are similar to the parse function, with the same schema, and only practical differences distinguish each function. It is described only in the last nested function, which is used to save information from each single item into items fields, listed in items.py file:
def parse_items (self, response): sel = Selector(response) requests =  try: price = sel.xpath('//ul/li/text()').extract().strip().split(u'€') except IndexError: price = sel.xpath('//a/text()"]').extract().strip().split(u'€') item = ProductsItem( url= sel.xpath('//a[@class="url"]/@href').extract().strip(), website=self.name, name=sel.xpath('//h1[@class="headline"]/text()').extract().strip(), description=sel.xpath('//div[@class="description"]').extract().strip(), price=price, currency=u'euro', images_urls=[urlparse.urljoin(BASE_URL,x.strip()) for x in sel.xpath('//div[@id="id"]/img/@src').extract()] ) return [item]
In this function we extract information from the selector of the web page into the items.py variables, with XPath syntax and we return the filled item.
Finally, when the spider has finished the execution, the information inside all of the items will be validated, cleaned, modified, saved, etc., with the functions inside the pipelines.py file.
When the scraper is finished we can run and test it in multiple ways.
One example to test the functions of the spider is to use the parse command:
scrapy parse --spider='my_spider' -c 'parse_item' -d 'level' -v 'item_url'
In this way, it is possible to fetch the given item_url and parse it with the specified spider, called my_spider, using the method passed in parse_item and with a depth expressed in level. The verbose mode is selected with –v and shows the results for each level. This test can be very useful to verify, singularly, each function of the crawler and its results, speeding up the testing process, especially during preliminary tests or when errors are often detected.
Fig 2: Scrapy parse function result
Another way to run the spider is with the command crawl. The simplest test is using the command without parameters:
scrapy crawl 'my_spider'
The spider will scan all the items in all the web pages found, starting from the start_urls vector, specified inside the spider. The command crawl can be customized with many parameters. Some of the more well-known and useful functions are:
- Log file. If logs are required, it is possible to write them in a file with the expression:
In settings.py, the variable log_level defines which log level is required (critical, error, warning, info, debug). Inside the spider code is possible to write further logs in addition to the system logs to create a better understanding of its work.
scrapy crawl 'my_spider' -s LOG_FILE=my_scraper.log
- Depth limit. It is possible to specify a depth limit at which the spider will not go beyond during the crawling, through the variable depth_limit in the settings.py file or using the command:
scrapy crawl 'my_spider' -s DEPTH_LIMIT='level'
- Write results. Another customization, of the crawl command, this allows it to dump results into a file that will be placed in the folder expressed in the variable output_path of the settings.py file, the file called file_name.extension:
scrapy crawl 'my_spider' -o 'file_name.extension' -t 'extension'
- Start and stop. To enable the persistence spider state and, thus, the start/stop feature when run the spider, is used the command:
scrapy crawl 'my_spider' -s JOBDIR=crawls/my_spider-1
There are many other features that have not been discussed here, but are available to model, build and test a web crawler with the Scrapy framework.