Scrapy components

Item pipelines

class zyte_common_items.pipelines.AEPipeline

Replace standard items with matching items with the old Zyte Automatic Extraction schema.

This item pipeline is intended to help in the migration from Zyte Automatic Extraction to Zyte API automatic extraction.

In the simplest scenarios, it can be added to the ITEM_PIPELINES setting in migrated code to ensure that the schema of output items matches the old schema.

In scenarios where page object classes were being used to fix, extend or customize extraction, it is recommended to migrate page object classes to the new schemas, or move page object class code to the corresponding spider callback.

If you have callbacks with custom code based on the old schema, you can either migrate that code, and ideally move it to a page object class, or use zyte_common_items.ae.downgrade at the beginning of the callback, e.g.:

from zyte_common_items import ae

...


def parse_product(self, response: DummyResponse, product: Product):
    product = ae.downgrade(product)
    ...
class zyte_common_items.pipelines.DropLowProbabilityItemPipeline(crawler)

Item pipeline that drops items that have a low probability.

The ITEM_PROBABILITY_THRESHOLDS setting determines the probability thresholds. By default, items with probability < 0.1 are dropped.

dict objects with items as values are supported. For those, the probability of each item is evaluated, and items with a low probability are removed from the dict. If the dict ends up empty, it is dropped entirely.

ITEM_PROBABILITY_THRESHOLDS

Default: {"default": 0.1}

Allows defining a threshold for each item class and a default threshold for any other item class.

Thresholds for item classes can be defined using either an import path of the item class or directly using the item class itself.

For example:

from zyte_common_items import Article

ITEM_PROBABILITY_THRESHOLDS = {
    Article: 0.2,
    "zyte_common_items.Product": 0.3,
    "default": 0.15,
}

Log formatters

class zyte_common_items.log_formatters.ZyteLogFormatter

Log formatter that implements support for InfoDropItem.

class zyte_common_items.log_formatters.InfoDropItem(message: str, log_level: str | None = None)

DropItem subclass for items that should be dropped with an INFO message (instead of the default WARNING message).

It is used, for example, by DropLowProbabilityItemPipeline.