Tutorial

In this tutorial, we will scrape information of popular movies from IMDB

Code is avaliable at Examples.

Start Requests

First, we start our script with rewritting start_requests():

from acrawler import Crawler, Request
class IMDBCrawler(Crawler):
    config = {'MAX_REQUESTS': 6}

    async def start_requests(self):
        yield Request('https://www.imdb.com/chart/moviemeter')

Here we don’t explictly pass callback parameter to Request because the default parse() will automatically be binded as callback function to it for any request yielded from start_requests().

First Callback Parse

Then we rewrite parse() to parse the response:

class IMDBCrawler(Crawler):

    def parse(self, response):
        for tr in response.sel.css('.lister-list tr'):
            link = tr.css('.titleColumn a::attr(href)').get()
            if link:
                yield Request(response.urljoin(link), callback=self.parse_movie)

        # or using a shortcut method
        yield from response.follow(
            ".lister-list tr .titleColumn a::attr(href)", callback=self.parse_movie
        )

During parsing, the most important attribute is acrawler.http.Response.sel. It is a Parsel Selector. In this callback function, we also yield many new tasks Request and we explictly pass callback parameter to them.

Define MovieItem

Then we need to define a new ParselItem to store results:

 from acrawler import ParselItem
 from pprint import pprint

 def process_time(value):
     # a self-defined field processing function
     # process time to minutes
     # '3h 1min' -> 181
     if value:
         res = 0
         segs = value.split(' ')
         for seg in segs:
             if seg.endswith('min'):
                 res += int(seg.replace('min',''))
             elif seg.endswith('h'):
                 res += 60*int(seg.replace('h',''))
         return res
     else:
         return value

class MovieItem(ParselItem):
   log = True
   css = {
      # just some normal css rules
      # see Parsel for detailed information
      "date": ".subtext a[href*=releaseinfo]::text",
      "rating": "span[itemprop=ratingValue]::text",
      "rating_count": "span[itemprop=ratingCount]::text",
      "metascore": ".metacriticScore span::text",

      # if you provide a list with additional functions,
      # they are considered as field processor function
      "title": ["h1::text", str.strip],
      "time": [".subtext time::text", process_time],

      # the following four fules is for getting all matching values
      # the rule starts with [ and ends with ] comparing to normal rules
      "genres": "[.subtext a[href*=genres]::text]",
      "director": "[h4:contains(Director) ~ a[href*=name]::text]",
      "writers": "[h4:contains(Writer) ~ a[href*=name]::text]",
      "stars": "[h4:contains(Star) ~ a[href*=name]::text]",
   }

     def custom_process(self):
         pprint(self)

Parse Movie Page

Then we write our callback function for movie page:

class IMDBCrawler(Crawler):

    async def parse_movie(self, response):
        url = response.url_str
        yield MovieItem(response.sel, extra={'url': url.split('?')[0]})

Here in this callback function, we yield a new task MovieItem, which will execute and collect all information from the page.

We also pass a dictionary to extra. During initialing, item’s content will be updated from extra at first.

Start Crawling

To start crawling, simply write:

if __name__ == "__main__":
    IMDBCrawler().run()

Here is one of the items:

{'date': '26 April 2019 (USA)',
'director': ['Anthony Russo', 'Joe Russo'],
'genres': ['Action', 'Adventure', 'Sci-Fi'],
'metascore': '78',
'rating': '8.8',
'rating_count': '407,691',
'stars': ['Robert Downey Jr.', 'Chris Evans', 'Mark Ruffalo'],
'time': 181,
'title': 'Avengers: Endgame',
'url': 'https://www.imdb.com/title/tt4154796/',
'writers': ['Christopher Markus', 'Stephen McFeely']}

Register a Handler

We can define a dummy handler to send a warning if the movie is a horror movie:

@register()
class HorrorHandler(Handler):
    family = 'MovieItem'
    logger = get_logger('horrorlogger')

    async def handle_after(self, item):
        if item['genres'] and 'Horror' in item['genres']:
            self.logger.warning(
                "({}) is a horror movie!!!!".format(item['title']))

In this case, handler is register to MovieItem with a specific family provided:

2019-05-24 18:37:22,888 acrawler.horrorlogger WARNING  (Midsommar) is a horror movie!!!!

Periodical & Persistent

If we want the crawler supports keyboard interupt(Ctrl-C) and resumes crawling next time, the config PERSISTENT should be set.

If we want to recrawl the index page every 4 hour starting from a specific time, we can provide recrawl and exetime parameters:

import time
class IMDBCrawler(Crawler):
    config = {
        'MAX_REQUESTS': 6,
        'PERSISTENT': True,
        'PERSISTENT_NAME': 'IMDBv0.1'
    }

    async def start_requests(self):
        yield Request('https://www.imdb.com/chart/moviemeter',
                      exetime=time.mktime((2019,5,24,18,30,0,0,0,0)),
                      recrawl=4*60*60)