Scrapy Pipelines

class zyte_common_items.pipelines.AEPipeline

Replace standard items with matching items with the old Zyte Automatic Extraction schema.

This item pipeline is intended to help in the migration from Zyte Automatic Extraction to Zyte API automatic extraction.

In the simplest scenarios, it can be added to the ITEM_PIPELINES setting in migrated code to ensure that the schema of output items matches the old schema.

In scenarios where page object classes were being used to fix, extend or customize extraction, it is recommended to migrate page object classes to the new schemas, or move page object class code to the corresponding spider callback.

If you have callbacks with custom code based on the old schema, you can either migrate that code, and ideally move it to a page object class, or use zyte_common_items.ae.downgrade at the beginning of the callback, e.g.:

from zyte_common_items import ae

...


def parse_product(self, response: DummyResponse, product: Product):
    product = ae.downgrade(product)
    ...
class zyte_common_items.pipelines.DropLowProbabilityItemPipeline(crawler)

This pipeline drops an item if its probability, defined in the settings, is less than the specified threshold.

By default, 0.1 threshold is used, i.e. items with probabillity < 0.1 are dropped.

You can customize the thresholds by using the ITEM_PROBABILITY_THRESHOLDS setting that offers greater flexibility, allowing you to define thresholds for each Item class separately or set a default threshold for all other item classes.

Thresholds for Item classes can be defined using either the path to the Item class or directly using the Item classes themselves.

The example of using ITEM_PROBABILITY_THRESHOLDS:

from zyte_common_items import Article

ITEM_PROBABILITY_THRESHOLDS = {
    Article: 0.2,
    "zyte_common_items.Product": 0.3,
    "default": 0.15,
}