Field processors

Overview

This library provides useful field processors (web-poet documentation) and complementary mixins. Built-in page object classes and extractor classes use them by default for the corresponding fields.

By design, the processors enabled by default are “transparent”: they don’t change the output of the field if the result is of the expected final type. For example, if there is a str attribute in the item, and the field returns str value, the default processor returns the value as-is.

Usually, to engage a built-in field processor, a field must return a Selector, SelectorList, or HtmlElement object. Then the field processor takes care of extracting the right data.

Field mapping

The following table indicates which fields use which processors by default in built-in page object classes and extractor classes:

Field

Default processor

aggregateRating

rating_processor()

brand

brand_processor()

breadcrumbs

breadcrumbs_processor()

description (excluding articles)

description_processor()

descriptionHtml

description_html_processor()

gtin

gtin_processor()

images

images_processor()

metadata

metadata_processor()

price

price_processor()

regularPrice

simple_price_processor()

Examples

Here are examples of inputs and matching field implementations that work on built-in page object and extractor classes:

Input HTML fragment

Field implementation and output

<span class="reviews">
  3.8 (7 reviews)
</span>
@field
def aggregateRating(self):
    return self.css(".reviews")
Product(
    aggregateRating=AggregateRating(
        bestRating=None,
        ratingValue=3.8,
        reviewCount=7,
    ),
)
Supports separate selectors per field.
<p class="brand">
  <img alt='Some Brand'>
</p>
@field
def brand(self):
    return self.css(".brand")
Product(
    brand="Some Brand",
)
<div class="nav">
  <ul>
    <li>
      <a href="/home">Home</a>
    </li>
    <li>
      <a href="/about">About</a>
    </li>
  </ul>
</div>
@field
def breadcrumbs(self):
    return self.css(".nav")
Product(
    breadcrumbs=[
        Breadcrumb(
            name="Home",
            url="https://example.com/home",
        ),
        Breadcrumb(
            name="About",
            url="https://example.com/about",
        ),
    ],
)
<div class="desc">
  <p>Ideal for <b>scraping</b> glass.</p>
  <p>Durable and reusable.</p>
</div>
@field
def descriptionHtml(self):
    return self.css(".desc")
Product(
    description=(
        "Ideal for scraping glass.\n"
        "\n"
        "Durable and reusable."
    ),
    descriptionHtml=(
        "<article>\n"
        "\n"
        "<p>Ideal for "
        "<strong>scraping</strong> "
        "glass.</p>\n"
        "\n"
        "<p>Durable and reusable.</p>\n"
        "\n"
        "</article>"
    ),
)
<span class="gtin">
  978-1-933624-34-1
</span>
@field
def gtin(self):
    return self.css(".gtin")
Product(
    gtin=[
        ("isbn13", "9781933624341"),
    ],
)
<div class="price">
  <del>13,2 €</del>
  <b>10,2 €</b>
</div>
@field
def price(self):
    return self.css(".price b")

@field
def regularPrice(self):
    return self.css(".price del")
Product(
    currencyRaw="€",
    price="10.20",
    regularPrice="13.20",
)