Basic Concepts

About Tasks

Anything in aCrawler is a Task, which execute() and then may yield new Task.

There are several basic Tasks defined here.

  • Request task executes its default fetch() method to make HTTP request. Then task will automatically yield a corresponding Response task. You can pass a function to callback argument and provide a family, which are all passed to the response task.
  • Response task executes callback(). It call all functions in callbacks with http response and may yield new task. A Response may have several callback functions (which are passed from decorator callback() or corresponding request’s parameter).
  • Item task executes its custom_process() method, which you can rewrite.
  • ParselItem extends from Item . It accepts a Selector and uses Parsel to parse content.
  • Any new Task yielded from an existing Task ‘s execution will be catched and delivered to scheduler.
  • Any new dictionary yielded from an existing Task’s execution will be catched as DefaultItem.

About Families

  • Each Handler has only one family. If a handler’s family is in a task’s families, this handler matches the task and then somes fuctions will be called before and after the task.
  • Each task has families (defaults to names of all base classes and itself). If you pass family to a task, it will be appended to task’s families. Specially, a Request ‘s user-passed family will be passed to its corresponding Response’s family.
  • family is also used for decorator callback() and register()
    • You can use decorator @register() to add a handler to crawler. It is also allowed to register a function but you should provide family, position as parameters. If a handler’s family is in a task’s families, then handler matches task.
    • You can use decorator @callback(family='') to add a callback to response. If family in @callback() is in a response’s families, then callback will be combined to this response.