Main Interface

Crawler

class acrawler.crawler.Crawler(config=None, middleware_config=None, request_config=None)

This is the base crawler, from which all crawlers that you write yourself must inherit.

Attributes:

start_urls = []

Ready for vanilla start_requests()

parsers = []

Shortcuts for parsing response.

Crawler will automatically append acrawler.parser.Parser.parse() to response’s callbacks list for each parser in parsers.

config = {}

Config dictionary for this crawler. See avaliable options in setting.

middleware_config = {}

Key-value pairs for handler-priority.

Examples

Handler with higer priority handles the task earlier. Priority 0 will disable the handler:

{
    'some_old_handler': 0,
    'myhandler_first': 1000,
    'myhandler': 500
}
request_config = {}

Key-Value pairs will be passed as keyword arguments to every sended aiohttp.request.

acceptable keyword

params - Dictionary or bytes to be sent in the query string of the new request

data - Dictionary, bytes, or file-like object to send in the body of the request

json - Any json compatible python object

headers - Dictionary of HTTP Headers to send with the request

cookies - Dict object to send with the request

allow_redirects - If set to False, do not follow redirects

timeout - Optional ClientTimeout settings structure, 5min total timeout by default.

middleware = None

Singleton object acrawler.middleware.middleware

run()

Core method of the crawler. Usually called to start crawling.

web_add_task_query(query)

This method is to deal with web requests if you enable the web service. New tasks should be yielded in this method. And Crawler will finish tasks to send response. Should be overwritten.

Parameters:query (dict) – a multidict.
web_action_after_query(items)

Action to be done after the web service finish the query and tasks. Should be overwritten.

add_task(new_task, dont_filter=False, ancestor=None, flag=1)

Interface to add new Task to scheduler.

Parameters:new_task (Task) – a Task or a dictionary which will be catched as DefaultItem task.
Return type:bool
Returns:True if the task is successfully added.
manager()

Create multiple workers to execute tasks.

next_requests()

This method will be binded to the event loop as a task. You can add task manually in this method.

parse(response)

Default callback function for Requests generated by default start_requests().

Parameters:response (Response) – the response task generated from corresponding request.
start_requests()

Should be rewritten for your custom spider.

Otherwise it will yield every url in start_urls. Any Request yielded from start_requests() will combine parse() to its callbacks and passes all callbacks to Response

Base Task

class acrawler.task.Task(dont_filter=False, ignore_exception=False, priority=0, meta=None, family=None, recrawl=0, exetime=0)

Task is scheduled by crawler to execute.

Parameters:
  • dont_filter (bool) – if True, every instance of the Task will be considered as new task. Otherwise fingerprint will be checked to prevent duplication.
  • ignore_exception (bool) – if True, any exception catched from the task’s execution will not retry the task.
  • fingerprint_func – A function that receives Task as parameter.
  • priority (int) – Tasks are scheduled in priority order (higher first). If priorities are same, the one initialized earlier executes first.(FIFO)
  • meta (Optional[dict]) – additional information about a task. It can be used with fingerprint, execute() or middleware’s methods. If a task’s execution yields new task, old task’s meta should be passed to the new one.
  • family – used to distinguish task’s type. Defaults to __class__.__name__.
tries = None

Every execution increase it by 1. If a task’s tries is larger than scheduler’s max_tries, it will fail. Defaults to 0.

init_time = None

The timestamp of task’s initializing time.

last_crawl_time = None

The timestamp of task’ last execution time.

exceptions = None

a list to store exceptions occurs during execution

score

Implements its real priority based on expecttime and priority

fingerprint

returns value of _fingerprint().

execute(**kwargs)

main entry for a task to start working.

Parameters:
  • middleware – needed to call custom functions before or after executing work.
  • kwargs (Any) – additional keyword args will be passed to _execute()
Return type:

AsyncGenerator[Task, None]

Returns:

an asyncgenerator yields Task.

HTTP Task

class acrawler.http.Request(url, callback=None, method='GET', request_config=None, status_allowed=None, encoding=None, links_to_abs=True, dont_filter=False, ignore_exception=False, meta=None, priority=0, family=None, recrawl=0, exetime=0, **kwargs)

Request is a Task that execute fetch() method.

url
callback

should be a callable function or a list of functions. It will be passed to the corresponding response task.

family

this family will be appended in families and also passed to corresponding response task.

status_allowed

a list of allowed status integer. Otherwise any response task with status!=200 will fail and retry.

meta

a dictionary to deliver information. It will be passed to Response.meta.

request_config

a dictionary, will be passed as keyword arguments to aiohttp.ClientSession.request().

acceptable keyword:

params - Dictionary or bytes to be sent in the query string of the new request

data - Dictionary, bytes, or file-like object to send in the body of the request

json - Any json compatible python object

headers - Dictionary of HTTP Headers to send with the request

cookies - Dict object to send with the request

allow_redirects - If set to False, do not follow redirects

timeout - Optional ClientTimeout settings structure, 5min total timeout by default.

fetch()

Sends a request and return the response as a task.

send()

This method is used for independent usage of Request without Crawler.

class acrawler.http.Response(url, status, cookies, headers, request, body, encoding, links_to_abs=False, callbacks=None, **kwargs)

Response is a Task that execute parse function.

status

HTTP status code of response, e.g. 200.

url

url as yarl URL

url_str

url as str

sel

a Selector. See Parsel for parsing rules.

doc

a PyQuery object.

meta

a dictionary to deliver information. It comes from Request.meta.

ok

True if status==200 or status is allowed from Request.status_allowed

cookies

HTTP cookies of response (Set-Cookie HTTP header).

headers

A case-insensitive multidict proxy with HTTP headers of response.

history

Preceding requests (earliest request first) if there were redirects.

body

The whole response’s body as bytes.

text

Read response’s body and return decoded str

request

Point to the corresponding request object that generates this response.

callbacks

list of callback functions

ok

If the response is allowed by the config of request.

Return type:bool
update_sel(source=None)

Update response’s Selector.

Parameters:source – can be a string or a PyQuery object. if it’s None, use self.pq as source by default.
open(path=None)

Open in default browser

urljoin(a)

Accept a str (can be a relative url) or a Selector that has href attributes.

Return type:str
Returns:return a absolute url.
paginate(css, limit=0, pass_meta=False, **kwargs)

Follow links and yield requests with same callback functions. Additional keyword arguments will be used for constructing requests.

Parameters:
  • css (str) – css selector
  • limit (int) – max number of links to follow.
follow(css, callback=None, limit=0, pass_meta=False, **kwargs)

Yield requests in current page using css selector. Additional keyword arguments will be used for constructing requests.

Parameters:
  • css (str) – css selector
  • callback (callable, optional) – Defaults to None.
  • limit – max number of links to follow.
spawn(item, divider=None, pass_meta=True, **kwargs)

Yield items in current page Additional keyword arguments will be used for constructing items.

Parameters:
  • divider (str) – css divider
  • item (ParselItem) – item class
parse()

Parse the response(call all callback funcs) and return the list of yielded results. This method should be used in independent work without Crawler.

Return type:list
class acrawler.http.FileRequest(url, *args, fdir=None, fname=None, skip_if_exists=True, callback=None, method='GET', request_config=None, dont_filter=False, meta=None, priority=0, family=None, **kwargs)

A derived Request to download files.

class acrawler.http.BrowserRequest(url, *args, page_callback=None, callback=None, method='GET', request_config=None, dont_filter=False, meta=None, priority=0, family=None, **kwargs)

A derived Request using pyppeteer to crawl pages.

There are two ways to directly deal with pyppeteer.page.Page. You can rewrite method operate_page() or pass page_callback as parameter. Callback function accepts two parameters: page and resposne.

fetch()

Sends a request and return the response as a task.

operate_page(page, response)

Can be rewritten for customed operation on the page. Should be a asyncgenerator to yield new tasks.

Item Task

class acrawler.item.Item(extra=None, extra_from_meta=False, log=None, store=None, family=None, **kwargs)

Item is a Task that execute custom_process() work. Extending from MutableMapping so it provide a dictionary interface. Also you can use Item.content to directly access content.

extra

During initialing, content will be updated from extra at first.

content

Item stores information in the content, which is a dictionary.

custom_process()

can be rewritten for customed futhur processing of the item.

class acrawler.item.ParselItem(selector=None, extra=None, css=None, xpath=None, re=None, default=None, inline=None, inline_divider=None, bindmap=None, **kwargs)

The item working with Parsel.

The item receives Parsel’s selector and several rules. The selector will process item’s fields with these rules. Finally, it will call processors to process each field.

classmethod bind(field=None, map=False)

Bind field processor.

class acrawler.item.DefaultItem(extra=None, extra_from_meta=False, log=None, store=None, family=None, **kwargs)

Any python dictionary yielded from a task’s execution will be cathed as DefaultItem.

It’s the same as Item. But its families has one more member ‘DefaultItem’.

class acrawler.item.Processors

Processors are used to spawn field processing functions for ParselItem. All the methods are static.

static first()

get the first element from the values

static strip()

strip every string in values

classmethod map(func)

apply function to every item of filed’s values list

static filter(func=<class 'bool'>)

pick from those elements of the values list for which function returns true

static drop(func=<class 'bool'>)

If func return false, drop the Field.

static drop_item(func=<class 'bool'>)

If func return false, drop the Item.

static to_datetime(error_drop=False, error_keep=False, with_time=False, regex=None)

extract datetime, return None if not matched

Parameters:
  • error_drop (bool, optional) – drop the field if not matched, defaults to False
  • error_keep (bool, optional) – keep the original value if not matched, defaults to False
  • with_time (bool, optional) – regex with time parsing, defaults to False
  • regex (str, optional) – provided custom regex, defaults to None
static to_date(error_drop=False, error_keep=False, regex=None)

extract date, return None if not matched

Parameters:
  • error_drop (bool, optional) – drop the field if not matched, defaults to False
  • error_keep (bool, optional) – keep the original value if not matched, defaults to False
  • with_time (bool, optional) – regex with time parsing, defaults to False
  • regex (str, optional) – provided custom regex, defaults to None
static to_float(error_drop=False, error_keep=False, regex=None)

extract float, return None if not matched

Parameters:
  • error_drop (bool, optional) – drop the field if not matched, defaults to False
  • error_keep (bool, optional) – keep the original value if not matched, defaults to False
static to_int(error_drop=False, error_keep=False, regex=None)

extract int, return None if not matched

Parameters:
  • error_drop (bool, optional) – drop the field if not matched, defaults to False
  • error_keep (bool, optional) – keep the original value if not matched, defaults to False

Parser

class acrawler.parser.Parser(in_pattern='', follow_patterns=None, css_divider=None, item_type=None, extra=None, selectors_loader=None, callbacks=None)

A basic parser.

It is a shortcut class for parsing response. If there are parsers int :attr:Crawler.parsers, then crawler will call Parser’s parse method with the response to yield new Request Task or Item Task.

Parameters:
  • in_pattern (str) – a string as a regex pattern or a function.
  • follow_patterns (Optional[List[str]]) – a list containing strings as regex patterns or a function.
  • item_type (Optional[ParselItem]) – a custom item class to store results.
  • css_divider (Optional[str]) – You may have many pieces in one response. Yield them in different selectors by providing a css_divider.
  • selectors_loader (Optional[Callable]) – a function accepts selector and yield selectors. Default one deals with css_divider.
  • callbacks (Optional[List[Callable]]) – additional callbacks.

Follow new links and yield Request in the response.

parse_items(response)

Get items from all selectors in the loader.

parse(response)

Main function to parse the response.

Handlers

class acrawler.middleware.Handler(family=None, func_before=None, func_after=None, func_start=None, func_close=None)

A handler wraps functions for a specific task.

priority = 500

A handler with higher priority will be checked with task earlier. A handler with priority 0 will be disabled.

family = '_Default'

Associated with Task’s families. One handler only has one family. If a handler’s family is in a task’s families, this handler matches the task and then somes fuctions will be called before and after the task.

handle_after(task)

Then function called after the execution of the task.

handle_before(task)

Then function called before the execution of the task.

on_close()

When Crawler closes, this method will be called.

on_start()

When Crawler starts(before start_requests()), this method will be called.

acrawler.middleware.register(family=None, position=None, priority=None)

Shortcut for middleware.register()

acrawler.handlers.callback(family)

The decorator to add callback function.

class acrawler.middleware._Middleware
register(family=None, position=None, priority=None)

The factory method for creating decorators to register handlers to middleware. Singledispathed for differenct types of targets.

If you register a function, you must give position and family. If you register a Handler class, you can register it without explicit parameters:

@register(family='Myfamily', position=1)
def my_func(task):
    print("This is called before execution")
    print(task)

@register()
class MyHandler(Handler):
    family = 'Myfamily'

    def handle(self, task):
        print("This is called before execution")
        print(task)
Parameters:
append_func(func, family=None, position=None, priority=None)

constructor a handler class from given function and register it.

Parameters:
  • func ([type]) –
  • family (str, optional) – , defaults to None
  • position (int, optional) – 0, 1, 2, 3, defaults to None
  • priority (int, optional) – , defaults to None
class acrawler.handlers.ItemToRedis(family=None, func_before=None, func_after=None, func_start=None, func_close=None)
family = 'Item'

Family of this handler.

address = 'redis://localhost'

An address where to connect. Can be one of the following:

  • a Redis URI — "redis://host:6379/0?encoding=utf-8";
  • a (host, port) tuple — ('localhost', 6379);
  • or a unix domain socket path string — "/path/to/redis.sock".
maxsize = 10

Maximum number of connection to keep in pool.

items_key = 'acrawler:items'

Key of the list at which item’s content is inserted.

handle_after(item)

Then function called after the execution of the task.

on_close()

When Crawler closes, this method will be called.

on_start()

When Crawler starts(before start_requests()), this method will be called.

class acrawler.handlers.ItemToMongo(family=None, func_before=None, func_after=None, func_start=None, func_close=None)
family = 'Item'

Family of this handler.

address = 'mongodb://localhost:27017'

a full mongodb URI, in addition to a simple hostname

db_name = ''

name of targeted database

col_name = ''

name of targeted collection

handle_after(item)

Then function called after the execution of the task.

on_close()

When Crawler closes, this method will be called.

on_start()

When Crawler starts(before start_requests()), this method will be called.

class acrawler.handlers.ResponseAddCallback(family=None, func_before=None, func_after=None, func_start=None, func_close=None)

(before execution) add Parser.parse() to Response.callbacks.

handle_before(response)

Then function called before the execution of the task.

class acrawler.handlers.ExpiredWatcher(*args, **kwargs)

Maintain a expired Event.

You can set this event and then meth`custom_expired_worker` will be waken up to do bypassing work. You should overwrite meth`custom_on_start` if needed rather than default one.

Parameters:
  • expired – a Event to tell the worker that your token is expired.
  • last_handle_time – a timestamp when the last work happened.
  • ttl – if set signal is sent at a time that less than last_handle_time + ttl, it will be ignored.
on_start()

When Crawler starts(before start_requests()), this method will be called.

class acrawler.handlers.ItemToMongo(family=None, func_before=None, func_after=None, func_start=None, func_close=None)
family = 'Item'

Family of this handler.

address = 'mongodb://localhost:27017'

a full mongodb URI, in addition to a simple hostname

db_name = ''

name of targeted database

col_name = ''

name of targeted collection

handle_after(item)

Then function called after the execution of the task.

on_close()

When Crawler closes, this method will be called.

on_start()

When Crawler starts(before start_requests()), this method will be called.

Setting/Config

There are default settings for aCrawler.

You can providing settings by writing a new setting.py in your working directory or writing in Crawler’s attributes.

acrawler.setting.DOWNLOAD_DELAY = 0

Every Request worker will delay some seconds before sending a new Request

acrawler.setting.DOWNLOAD_DELAY_SPECIAL_HOST = {}

Every Request worker for specific host will delay some seconds before sending a new Request

acrawler.setting.LOG_LEVEL = 'INFO'

Default log level.

acrawler.setting.LOG_TO_FILE = None

redirect log to a filepath

acrawler.setting.LOG_TIME_DELTA = 60

disabled

Type:how many seconds to log a new crawling statistics. 0
acrawler.setting.STATUS_ALLOWED = None

A list of intergers representing status codes other than 200.

acrawler.setting.MAX_TRIES = 3

A task will try to execute for max_tries times before a complete fail.

acrawler.setting.MAX_REQUESTS = 4

A crawler will obtain MAX_REQUESTS request concurrently.

acrawler.setting.MAX_REQUESTS_PER_HOST = 0

Limit simultaneous connections to the same host.

acrawler.setting.MAX_REQUESTS_SPECIAL_HOST = {}

Limit simultaneous connections with a host-limit dictionary.

acrawler.setting.REDIS_ENABLE = False

Set to True if you want distributed crawling support. If it is True, the crawler will obtain crawler.redis and lock itself always.

acrawler.setting.REDIS_START_KEY = None

And the crawler will try to get url from redis list REDIS_START_KEY if it is not None and send Request(also bind crawler.parse as its callback function)

acrawler.setting.REDIS_QUEUE_KEY = None
acrawler.setting.REDIS_DF_KEY = None
acrawler.setting.REDIS_ADDRESS = 'redis://localhost'
acrawler.setting.WEB_ENABLE = False

Set to True if you want web service support. If it is True, the crawler will lock itself always.

acrawler.setting.WEB_HOST = 'localhost'

Host for the web service.

acrawler.setting.WEB_PORT = 8079

Port for the web service.

acrawler.setting.LOCK_ALWAYS = False

Set to True if you don’t want the crawler exits after finishing tasks.

acrawler.setting.PERSISTENT = False

Set to True if you want stop-resume support. If you enable distributed support, this conf will be ignored.

acrawler.setting.PERSISTENT_NAME = None

A name tag for file-storage of persistent support

Utils

This module provides utility functions that are used by aCrawler. Some are used for external consumption.

acrawler.utils.merge_dicts(a, b)

Merges b into a

acrawler.utils.check_import(name, allow_import_error=False)

Safely import module only if it’s not imported

acrawler.utils.open_html(html, path=None)

A helper function to debug your response. Usually called with open_html(response.text).

acrawler.utils.get_logger(name='user')

Get a logger which has the same configuration as crawler’s logger.

acrawler.utils.redis_push_start_urls(key, url=None, address='redis://localhost')

When you are using redis_based distributed crawling, you can use this function to feed start_urls to redis.

acrawler.utils.sync_coroutine(coro, loop=None)

Run a coroutine in synchronized way.

acrawler.utils.redis_push_start_urls_coro(key, url=None, address='redis://localhost')

Coroutine version of redis_push_start_urls()