Item API

Product

class zyte_common_items.Product(**kwargs)

Product from an e-commerce website.

url is the only required attribute.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

additionalProperties: List[AdditionalProperty] | None

List of name-value pairs of product data.

Additional properties usually appear in product pages in the form of a specification table or a free-form specification list that can be easily turned into key-value pairs, where keys indicate the name of a property and values indicate the value of that property.

Additional properties that require 1 or more extra requests may not be extracted.

See also features.

aggregateRating: AggregateRating | None

Aggregate data about reviews and ratings.

availability: str | None

Product availability status.

The value is expected to be one of: "InStock", "OutOfStock".

brand: Brand | None

Brand or manufacturer of the product.

breadcrumbs: List[Breadcrumb] | None

Webpage breadcrumb trail.

canonicalUrl: str | None

Canonical form of the URL, as indicated by the website.

See also url.

color: str | None

Color of the product.

It is extracted as displayed (e.g. "white").

See also: size, style.

currency: str | None

Price currency ISO 4217 alphabetic code (e.g. "USD").

See also currencyRaw.

currencyRaw: str | None

Price currency as it appears on the webpage (no post-processing).

This is usually the currency that appears next to the price visually on the webpage. It is commonly a symbol but can also appear normalized already next to the price. For example, both “$” and “USD” are possible values.

Non-currencies, such as "-", should not be extracted as currencyRaw.

See also currency.

description: str | None

Plain-text, complete product description.

If the description is split across different parts of the source webpage, only the main part, containing the most useful pieces of information, should be extracted into this attribute.

It may contain data found in other attributes (features, additionalProperties).

Format-wise:

  • Line breaks and non-ASCII characters are allowed.

  • There is no length limit for this attribute, the content should not be truncated.

  • There should be no whitespace at the beginning or end.

See also descriptionHtml.

descriptionHtml: str | None

HTML containing the complete product description.

See description for extraction details.

The format is not the raw HTML from the source webpage. See the HTML normalization specification for details.

features: List[str] | None

List of product features.

They are usually listed as bullet points in product webpages.

See also additionalProperties.

gtin: List[Gtin] | None

List of standardized GTIN product identifiers associated with the product, which are unique for the product across different sellers.

See also: mpn, productId, sku.

images: List[Image] | None

All product images.

The main image (see mainImage) should be first in the list.

Images only displayed as part of the product description are excluded.

mainImage: Image | None

Main image of the product.

metadata: ProductMetadata | None

Data extraction process metadata.

mpn: str | None

Manufacturer part number (MPN) of the product.

The MPN is issued by the manufacturer, so a product should have the same MPN across different e-commerce websites.

See also: gtin, productId, sku.

name: str | None

Product name as it appears on the webpage (no post-processing).

price: str | None

Price at which the product is being offered at the moment.

It must be formatted with a full stop as decimal separator and no thousands separator or currency, e.g. "10500.99".

If there are any discounts, this is the price with discounts applied.

If the price is indicated with and without value-added tax (VAT), this is the price with VAT.

See also: regularPrice, currency, currencyRaw.

productId: str | None

Product identifier, unique within an e-commerce website.

It may come in the form of an SKU or any other identifier, a hash, or even a URL.

See also: gtin, mpn, sku.

regularPrice: str | None

Price shown on the webpage as a price at which the product has been offered in the past by the same retailer, presented as a reference next to the current price.

It may be labeled as the original price, the price before discount, the list price, or the maximum retail price for which the product is sold.

It must be formatted with a full stop as decimal separator and no thousands separator or currency, e.g. "15000.99".

regularPrice must be None if price is None. If not None, regularPrice must be higher than price.

If price is extracted with value-added tax (VAT), regularPrice must be extracted with VAT. If price is extracted without VAT, regularPrice must be extracted without VAT.

See also: price, currency, currencyRaw.

size: str | None

Size, dimensions or volume of the product.

It is extracted as displayed (e.g. "XL", "32Wx34L", "Large", "750x450x800", "10m", "Height: 48cm - 86cm, Width: 204cm, Depth: 93cm").

See also: color, style.

sku: str | None

Stock keeping unit (SKU) identifier, i.e. a merchant-specific product identifier.

See also: gtin, mpn, productId.

style: str | None

Style, pattern or finish of the product.

It is extracted as displayed (e.g. "polka dots", "Striped", "Nickel finish with Translucent glass").

See also: color, size.

url: str

Main URL from which the data has been extracted.

See also canonicalUrl.

variants: List[ProductVariant] | None

List of product variants.

When slightly different versions of a product are displayed on the same product page, allowing you to choose a specific product version from a selection, each of those product versions are considered a product variant.

Product variants usually differ in color or size.

The following items are not considered product variants:

  • Other products.

  • Recommended products.

  • Different products within the same bundle of products.

  • Product add-ons, e.g. premium upgrades of a base product.

If only one “variant” is shown in the page, it is not considered a product variant.

Only variant-specific data is extracted as product variant details. For example, if variant-specific versions of the product description do not exist in the source webpage, the description attributes of the product variant are not filled with the base product description.

Extracted product variants may not include those that are not visible in the source webpage.

Product variant details may not include those that require multiple additional requests (e.g. 1 or more requests per variant).

There must not be duplicate variants.

class zyte_common_items.ProductVariant(**kwargs)

Product variant.

See Product.variants, ProductVariantExtractor, ProductVariantSelectorExtractor.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

additionalProperties: List[AdditionalProperty] | None

List of name-value pais of data about a specific, otherwise unmapped feature.

Additional properties usually appear in product pages in the form of a specification table or a free-form specification list.

Additional properties that require 1 or more extra requests may not be extracted.

See also features.

availability: str | None

Availability status.

The value is expected to be one of: "InStock", "OutOfStock".

canonicalUrl: str | None

Canonical form of the URL, as indicated by the website.

See also url.

color: str | None

Color.

It is extracted as displayed (e.g. "white").

See also: size, style.

currency: str | None

Price currency ISO 4217 alphabetic code (e.g. "USD").

See also currencyRaw.

currencyRaw: str | None

Price currency as it appears on the webpage (no post-processing), e.g. "$".

See also currency.

gtin: List[Gtin] | None

List of standardized GTIN product identifiers associated with the product, which are unique for the product across different sellers.

See also: mpn, productId, sku.

images: List[Image] | None

All product images.

The main image (see mainImage) should be first in the list.

Images only displayed as part of the product description are excluded.

mainImage: Image | None

Main product image.

mpn: str | None

Manufacturer part number (MPN).

A product should have the same MPN across different e-commerce websites.

See also: gtin, productId, sku.

name: str | None

Name as it appears on the webpage (no post-processing).

price: str | None

Price at which the product is being offered.

It is a string with the price amount, with a full stop as decimal separator, and no thousands separator or currency (see currency and currencyRaw), e.g. "10500.99".

If regularPrice is not None, price should always be lower than regularPrice.

productId: str | None

Product identifier, unique within an e-commerce website.

It may come in the form of an SKU or any other identifier, a hash, or even a URL.

See also: gtin, mpn, sku.

regularPrice: str | None

Price at which the product was being offered in the past, and which is presented as a reference next to the current price.

It may be labeled as the original price, the list price, or the maximum retail price for which the product is sold.

See price for format details.

If regularPrice is not None, it should always be higher than price.

size: str | None

Size or dimensions.

Pertinent to products such as garments, shoes, accessories, etc.

It is extracted as displayed (e.g. "XL").

See also: color, style.

sku: str | None

Stock keeping unit (SKU) identifier, i.e. a merchant-specific product identifier.

See also: gtin, mpn, productId.

style: str | None

Style.

Pertinent to products such as garments, shoes, accessories, etc.

It is extracted as displayed (e.g. "polka dots").

See also: color, size.

url: str | None

Main URL from which the product variant data could be extracted.

class zyte_common_items.ProductMetadata(**kwargs)

Metadata class for zyte_common_items.Product.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Product list

class zyte_common_items.ProductList(**kwargs)

Product list from a product listing page of an e-commerce webpage.

It represents, for example, a single page from a category.

url is the only required attribute.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

breadcrumbs: List[Breadcrumb] | None

Webpage breadcrumb trail.

canonicalUrl: str | None

Canonical form of the URL, as indicated by the website.

See also url.

categoryName: str | None

Name of the product listing as it appears on the webpage (no post-processing).

For example, if the webpage is one of the pages of the Robots category, categoryName is 'Robots'.

metadata: ProductListMetadata | None

Data extraction process metadata.

pageNumber: int | None

Current page number, if displayed explicitly on the list page.

Numeration starts with 1.

paginationNext: Link | None

Link to the next page.

products: List[ProductFromList] | None

List of products.

It only includes product information found in the product listing page itself. Product information that requires visiting each product URL is not meant to be covered.

The order of the products reflects their position on the rendered page. Product order is top-to-bottom, and left-to-right or right-to-left depending on the webpage locale.

url: str

Main URL from which the data has been extracted.

See also canonicalUrl.

class zyte_common_items.ProductFromList(**kwargs)

Product from a product list from a product listing page of an e-commerce webpage.

See ProductList, ProductFromListExtractor, ProductFromListSelectorExtractor.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

currency: str | None

Price currency ISO 4217 alphabetic code (e.g. "USD").

See also currencyRaw.

currencyRaw: str | None

Price currency as it appears on the webpage (no post-processing), e.g. "$".

See also currency.

mainImage: Image | None

Main product image.

metadata: ProbabilityMetadata | None

Data extraction process metadata.

name: str | None

Name as it appears on the webpage (no post-processing).

price: str | None

Price at which the product is being offered.

It is a string with the price amount, with a full stop as decimal separator, and no thousands separator or currency (see currency and currencyRaw), e.g. "10500.99".

If regularPrice is not None, price should always be lower than regularPrice.

productId: str | None

Product identifier, unique within an e-commerce website.

It may come in the form of an SKU or any other identifier, a hash, or even a URL.

regularPrice: str | None

Price at which the product was being offered in the past, and which is presented as a reference next to the current price.

It may be labeled as the original price, the list price, or the maximum retail price for which the product is sold.

See price for format details.

If regularPrice is not None, it should always be higher than price.

url: str | None

Main URL from which the product data could be extracted.

class zyte_common_items.ProductListMetadata(**kwargs)

Metadata class for zyte_common_items.ProductList.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Product navigation

class zyte_common_items.ProductNavigation(**kwargs)

Represents the navigational aspects of a product listing page on an e-commerce website.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

categoryName: str | None

Name of the category/page with the product list.

Format:

  • trimmed (no whitespace at the beginning or the end of the description string)

items: List[ProbabilityRequest] | None

List of product links found on the page category ordered by their position in the page.

metadata: ProductNavigationMetadata | None

Data extraction process metadata.

nextPage: Request | None

A link to the next page, if available.

pageNumber: int | None

Number of the current page.

It should only be extracted if the webpage shows a page number.

It must be 1-based. For example, if the first page of a listing is numbered as 0 on the website, it should be extracted as 1 nonetheless.

subCategories: List[ProbabilityRequest] | None

List of sub-category links ordered by their position in the page.

url: str

Main URL from which the data is extracted.

class zyte_common_items.ProductNavigationMetadata(**kwargs)

Metadata class for zyte_common_items.ProductNavigation.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Article

class zyte_common_items.Article(**kwargs)

Article, typically seen on online news websites, blogs, or announcement sections.

url is the only required attribute.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

articleBody: str | None

Clean text of the article, including sub-headings, with newline separators.

Format:

  • trimmed (no whitespace at the beginning or the end of the body string),

  • line breaks included,

  • no length limit,

  • no normalization of Unicode characters.

articleBodyHtml: str | None

Simplified and standardized HTML of the article, including sub-headings, image captions and embedded content (videos, tweets, etc.).

Format: HTML string normalized in a consistent way.

audios: List[Audio] | None

All audios.

authors: List[Author] | None

All authors of the article.

breadcrumbs: List[Breadcrumb] | None

Webpage breadcrumb trail.

canonicalUrl: str | None

Canonical form of the URL, as indicated by the website.

See also url.

dateModified: str | None

Date when the article was most recently modified.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ” or “YYYY-MM-DDThh:mm:ss±zz:zz”.

With timezone, if available.

dateModifiedRaw: str | None

Same date as dateModified, but before parsing/normalization, i.e. as it appears on the website.

datePublished: str | None

Publication date of the article.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ” or “YYYY-MM-DDThh:mm:ss±zz:zz”.

With timezone, if available.

If the actual publication date is not found, the value of dateModified is used instead.

datePublishedRaw: str | None

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

description: str | None

A short summary of the article.

It can be either human-provided (if available), or auto-generated.

headline: str | None

Headline or title.

images: List[Image] | None

All images.

inLanguage: str | None

Language of the article, as an ISO 639-1 language code.

Sometimes the article language is not the same as the web page overall language.

mainImage: Image | None

Main image.

metadata: ArticleMetadata | None

Data extraction process metadata.

url: str

The main URL of the article page.

The URL of the final response, after any redirects.

Required attribute.

In case there is no article data on the page or the page was not reached, the returned “empty” item would still contain this URL field.

videos: List[Video] | None

All videos.

class zyte_common_items.ArticleMetadata(**kwargs)

Metadata class for zyte_common_items.Article.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Article list

class zyte_common_items.ArticleList(**kwargs)

Article list from an article listing page.

The url attribute is the only required attribute, all other fields are optional.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

articles: List[ArticleFromList] | None

List of article details found on the page.

The order of the articles reflects their position on the page.

breadcrumbs: List[Breadcrumb] | None

Webpage breadcrumb trail.

canonicalUrl: str | None

Canonical form of the URL, as indicated by the website.

See also url.

metadata: ArticleListMetadata | None

Data extraction process metadata.

url: str

The main URL of the article list.

The URL of the final response, after any redirects.

Required attribute.

In case there is no article list data on the page or the page was not reached, the returned item still contain this URL field and all the other available datapoints.

class zyte_common_items.ArticleFromList(**kwargs)

Article from an article list from an article listing page.

See ArticleList.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

articleBody: str | None

Clean text of the article, including sub-headings, with newline separators.

Format:

  • trimmed (no whitespace at the beginning or the end of the body string),

  • line breaks included,

  • no length limit,

  • no normalization of Unicode characters.

authors: List[Author] | None

All authors of the article.

datePublished: str | None

Publication date of the article.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ” or “YYYY-MM-DDThh:mm:ss±zz:zz”.

With timezone, if available.

If the actual publication date is not found, the date of the last modification is used instead.

datePublishedRaw: str | None

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

headline: str | None

Headline or title.

images: List[Image] | None

All images.

inLanguage: str | None

Language of the article, as an ISO 639-1 language code.

Sometimes the article language is not the same as the web page overall language.

mainImage: Image | None

Main image.

metadata: ProbabilityMetadata | None

Data extraction process metadata.

url: str | None

Main URL.

class zyte_common_items.ArticleListMetadata(**kwargs)

Metadata class for zyte_common_items.ArticleList.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Article navigation

class zyte_common_items.ArticleNavigation(**kwargs)

Represents the navigational aspects of an article listing webpage.

See ArticleList.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

categoryName: str | None

Name of the category/page.

Format:

  • trimmed (no whitespace at the beginning or the end of the description string)

items: List[ProbabilityRequest] | None

Links to listed items in order of appearance.

metadata: ArticleNavigationMetadata | None

Data extraction process metadata.

nextPage: Request | None

A link to the next page, if available.

pageNumber: int | None

Number of the current page.

It should only be extracted if the webpage shows a page number.

It must be 1-based. For example, if the first page of a listing is numbered as 0 on the website, it should be extracted as 1 nonetheless.

subCategories: List[ProbabilityRequest] | None

List of sub-category links ordered by their position in the page.

url: str

Main URL from which the data is extracted.

class zyte_common_items.ArticleNavigationMetadata(**kwargs)

Metadata class for zyte_common_items.ArticleNavigation.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Business place

class zyte_common_items.BusinessPlace(**kwargs)

Business place, with properties typically seen on maps or business listings.

url is the only required attribute.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

actions: List[NamedLink] | None

List of actions that can be performed directly from the URLs on the place page, including URLs.

additionalProperties: List[AdditionalProperty] | None

List of name-value pais of any unmapped additional properties specific to the place.

address: Address | None

The address details of the place.

aggregateRating: AggregateRating | None

The overall rating, based on a collection of reviews or ratings.

amenityFeatures: List[Amenity] | None

List of amenities of the place.

categories: List[str] | None

List of categories the place belongs to.

containedInPlace: ParentPlace | None

If the place is located inside another place, these are the details of the parent place.

description: str | None

The description of the place.

Stripped of white spaces.

features: List[str] | None

List of frequently mentioned features of this place.

images: List[Image] | None

A list of URL values of all images of the place.

isVerified: bool | None

If the information is verified by the owner of this place.

map: str | None

URL to a map of the place.

metadata: BusinessPlaceMetadata | None

Data extraction process metadata.

name: str | None

The name of the place.

openingHours: List[OpeningHoursItem] | None

Ordered specification of opening hours, including data for opening and closing time for each day of the week.

placeId: str | None

Unique identifier of the place on the website.

priceRange: str | None

How is the price range of the place viewed by its customers (from z to zzzz).

reservationAction: NamedLink | None

The details of the reservation action, e.g. table reservation in case of restaurants or room reservation in case of hotels.

reviewSites: List[NamedLink] | None

List of partner review sites.

starRating: StarRating | None

Official star rating of the place.

tags: List[str] | None

List of the tags associated with the place.

telephone: str | None

The phone number associated with the place, as it appears on the page.

timezone: str | None

Which timezone is the place situated in.

Standard: Name compliant with IANA tz database (tzdata).

url: str | None

The main URL that the place data was extracted from.

The URL of the final response, after any redirects.

In case there is no product data on the page or the page was not reached, the returned “empty” item would still contain url field and metadata field with dateDownloaded.

website: str | None

The URL pointing to the official website of the place.

class zyte_common_items.BusinessPlaceMetadata(**kwargs)

Metadata class for zyte_common_items.BusinessPlace.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

searchText: str | None

The search text used to find the item.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Real estate

class zyte_common_items.RealEstate(**kwargs)

Real state offer, typically seen on real estate offer aggregator websites.

url is the only required attribute.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

additionalProperties: List[AdditionalProperty] | None

A name-value pair field holding information pertaining to specific features. Usually in a form of a specification table or freeform specification list.

address: Address | None

The details of the address of the real estate.

area: RealEstateArea | None

Real estate area details.

breadcrumbs: List[Breadcrumb] | None

Webpage breadcrumb trail.

currency: str | None

The currency of the price, in 3-letter ISO 4217 format.

currencyRaw: str | None

Currency associated with the price, as appears on the page (no post-processing).

datePublished: str | None

Publication date of the real estate offer.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ”

With timezone, if available.

datePublishedRaw: str | None

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

description: str | None

The description of the real estate.

Format:

  • trimmed (no whitespace at the beginning or the end of the description string),

  • line breaks included,

  • no length limit,

  • no normalization of Unicode characters,

  • no concatenation of description from different parts of the page.

images: List[Image] | None

A list of URL values of all images of the real estate.

mainImage: Image | None

The details of the main image of the real estate.

metadata: RealEstateMetadata | None

Contains metadata about the data extraction process.

name: str | None

The name of the real estate.

numberOfBathroomsTotal: int | None

The total number of bathrooms in the real estate.

numberOfBedrooms: int | None

The number of bedrooms in the real estate.

numberOfFullBathrooms: int | None

The number of full bathrooms in the real estate.

numberOfPartialBathrooms: int | None

The number of partial bathrooms in the real estate.

numberOfRooms: int | None

The number of rooms (excluding bathrooms and closets) of the real estate.

price: str | None

The offer price of the real estate.

propertyType: str | None

Type of the property, e.g. flat, house, land.

realEstateId: str | None

The identifier of the real estate, usually assigned by the seller and unique within a website, similar to product SKU.

rentalPeriod: str | None

The rental period to which the rental price applies, only available in case of rental. Usually weekly, monthly, quarterly, yearly.

tradeType: str | None

Type of a trade action: buying or renting.

url: str

The url of the final response, after any redirects.

virtualTourUrl: str | None

The URL of the virtual tour of the real estate.

yearBuilt: int | None

The year the real estate was built.

class zyte_common_items.RealEstateMetadata(**kwargs)

Metadata class for zyte_common_items.RealEstate.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Job posting

class zyte_common_items.JobPosting(**kwargs)

A job posting, typically seen on job posting websites or websites of companies that are hiring.

url is the only required attribute.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

baseSalary: BaseSalary | None

The base salary of the job or of an employee in the proposed role.

dateModified: str | None

The date when the job posting was most recently modified.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ”

With timezone, if available.

dateModifiedRaw: str | None

Same date as dateModified, but before parsing/normalization, i.e. as it appears on the website.

datePublished: str | None

Publication date of the job posting.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ”

With timezone, if available.

datePublishedRaw: str | None

Same date as datePublished, but before parsing/normalization, i.e. as it appears on the website.

description: str | None

A description of the job posting including sub-headings, with newline separators.

Format:

  • trimmed (no whitespace at the beginning or the end of the description string),

  • line breaks included,

  • no length limit,

  • no normalization of Unicode characters.

descriptionHtml: str | None

Simplified HTML of the description, including sub-headings, image captions and embedded content.

employmentType: str | None

Type of employment (e.g. full-time, part-time, contract, temporary, seasonal, internship).

headline: str | None

The headline of the job posting.

hiringOrganization: HiringOrganization | None

Information about the organization offering the job position.

jobLocation: JobLocation | None

A (typically single) geographic location associated with the job position.

jobPostingId: str | None

The identifier of the job posting.

jobStartDate: str | None

Job start date.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ”

With timezone, if available.

jobStartDateRaw: str | None

Same date as jobStartDate, but before parsing/normalization, i.e. as it appears on the website.

jobTitle: str | None

The title of the job posting.

metadata: JobPostingMetadata | None

Contains metadata about the data extraction process.

remoteStatus: str | None

Specifies the remote status of the position.

requirements: List[str] | None

Candidate requirements for the job.

url: str

The url of the final response, after any redirects.

validThrough: str | None

The date after which the job posting is not valid, e.g. the end of an offer.

Format: ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ”

With timezone, if available.

validThroughRaw: str | None

Same date as validThrough, but before parsing/normalization, i.e. as it appears on the website.

class zyte_common_items.JobPostingMetadata(**kwargs)

Metadata class for zyte_common_items.JobPosting.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Job posting navigation

class zyte_common_items.JobPostingNavigation(**kwargs)

Represents the navigational aspects of a job posting listing page on a job website.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

items: List[ProbabilityRequest] | None

List of job postings available on this page.

metadata: JobPostingNavigationMetadata | None

Data extraction process metadata.

nextPage: Request | None

A link to the next page, if available.

pageNumber: int | None

Number of the current page.

It should only be extracted if the webpage shows a page number.

It must be 1-based. For example, if the first page of a listing is numbered as 0 on the website, it should be extracted as 1 nonetheless.

url: str

Main URL from which the data is extracted.

class zyte_common_items.JobPostingNavigationMetadata(**kwargs)

Metadata class for zyte_common_items.JobPostingNavigation.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Search engine result

class zyte_common_items.SerpOrganicResult(**kwargs)

Data from a non-paid result of a search engine results page.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

description: str | None

Result excerpt.

displayedUrlText: str | None

Text displayed to represent url.

It may not be an actual URL, but some stylized or simplified representation of it. For example, if url is https://en.wikipedia.org/wiki/Foobar, displayedUrlText could be something like "https://en.wikipedia.org  wiki Foobar".

name: str | None

Result title.

rank: int | None

Result position among other organic results from the same search engine results page.

This is the rank within a specific page, not within an entire search. That is, the first result of any page, even if it not the first page of a search, must be 1.

url: str | None

Result URL.

Search engine results

class zyte_common_items.Serp(**kwargs)

Data from a search engine results page.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

metadata: SerpMetadata | None

Contains metadata about the data extraction process.

organicResults: List[SerpOrganicResult] | None

List of search results excluding paid results.

pageNumber: int | None

Page number.

url: str

Search URL.

class zyte_common_items.SerpMetadata(**kwargs)

Metadata class for zyte_common_items.Serp.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

displayedQuery: str | None

Search query as seen in the webpage.

searchedQuery: str | None

Search query as specified in the input URL.

totalOrganicResults: int | None

Total number of organic results reported by the search engine.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Social media post

class zyte_common_items.SocialMediaPost(**kwargs)

Represents a single social media post.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

author: SocialMediaPostAuthor | None

Details of the author of the post.

No easily identifiable information can be contained in here, such as usernames.

datePublished: str | None

The timestamp at which the post was created.

Format: Timezone: UTC. ISO 8601 format: “YYYY-MM-DDThh:mm:ssZ”

hashtags: List[str] | None

The list of hashtags contained in the post.

mediaUrls: List[Url] | None

The list of URLs of media files (images, videos, etc.) linked from the post.

metadata: SocialMediaPostMetadata | None

Contains metadata about the data extraction process.

postId: str | None

The identifier of the post.

reactions: Reactions | None

Details of reactions to the post.

text: str | None

The text content of the post.

url: str

The URL of the final response, after any redirects.

class zyte_common_items.SocialMediaPostMetadata(**kwargs)

Metadata class for zyte_common_items.SocialMediaPost.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

searchText: str | None

The search text used to find the item.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Forum thread

class zyte_common_items.ForumThread(**kwargs)

Represents a forum thread page.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

metadata: ForumThreadMetadata | None

Contains metadata about the data extraction process.

posts: List[SocialMediaPost] | None

List of posts available on the page, including the first or top post.

threadId: str | None

Thread ID.

topic: Topic | None

Topic discussed on the page.

url: str

The URL of the final response, after any redirects.

class zyte_common_items.ForumThreadMetadata(**kwargs)

Metadata class for zyte_common_items.ForumThread.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Search Request templates

class zyte_common_items.SearchRequestTemplate(**kwargs)

Request template to build a search Request.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

request(*, query: str | ~typing.Any = <object object>, keyword: str | ~typing.Any = <object object>) Request

Return a Request to search for keyword.

body: str | None

Jinja template for Request.body.

It must be a plain str, not bytes or a Base64-encoded str. Base64-encoding is done by request() after rendering this value as a Jinja template.

Defining a non-UTF-8 body is not supported.

headers: List[Header] | None

List of Header, for Request.headers, where every name and value is a Jinja template.

When a header name template renders into an empty string (after stripping spacing), that header is removed from the resulting list of headers.

metadata: SearchRequestTemplateMetadata | None

Data extraction process metadata.

method: str

Jinja template for Request.method.

url: str

Jinja template for Request.url.

class zyte_common_items.SearchRequestTemplateMetadata(**kwargs)

Metadata class for zyte_common_items.SearchRequestTemplate.metadata.

dateDownloaded: str | None

Date and time when the product data was downloaded, in UTC timezone and the following format: YYYY-MM-DDThh:mm:ssZ.

probability: float | None

The probability (0 for 0%, 1 for 100%) that the resource features the expected data type.

For example, if the extraction of a product from a given URL is requested, and that URL points to the webpage of a product with complete certainty, the value should be 1. If with complete certainty the webpage features a job listing instead of a product, the value should be 0. When there is no complete certainty, the value could be anything in between (e.g. 0.96).

validationMessages: Dict[str, List[str]] | None

Contains paths to fields with the description of issues found with their values.

Custom attributes

class zyte_common_items.CustomAttributes(**kwargs)

Extracted custom attribute values and metadata.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

metadata: CustomAttributesMetadata

Custom attribute extraction metadata.

values: CustomAttributesValues

Custom attribute values.

class zyte_common_items.CustomAttributesValues(**kwargs)

Bases: Dict[str, Any]

Container for custom attribute values.

class zyte_common_items.CustomAttributesMetadata(**kwargs)

Custom attribute extraction metadata.

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.

error: str | None

Error message, if any.

  • The extraction/unparsable-response error is given when the LLM response could not be parsed or recovered. If this error happens, we suggest simplifying the task or reducing the number of attributes.

  • The extraction/schema-size-exceeded error is given when the schema did not fit into the input limits, leaving no space for the input text, and therefore the LLM could not be used. If this error happens, we suggest either making the schema smaller (fewer attributes and/or shorter descriptions), or increasing maxInputTokens.

excludedPIIAttributes: List[str] | None

A list of all attributes dropped from the output due to a risk of PII (Personally Identifiable Information) extraction.

inputTokens: int | None

Total number of used input tokens, excluding our internal fixed prompt with the LLM instruction, when using the “generate” method.

maxInputTokens: int | None

Maximum number of allowed input tokens for the model, when using the “generate” method.

outputTokens: int | None

Total number of used output tokens, when using the “generate” method.

textInputTokens: int | None

Total number of input tokens used for the text of the web page, excluding the schema and our internal fixed prompt with the LLM instruction, when using the “generate” method. Already included in inputTokens.

textInputTokensBeforeTruncation: int | None

textInputTokens before the text was truncated to fit into the input limits, either set via maxInputTokens or due to the model limitation returned in maxInputTokens, when using the “generate” method.

Custom items

Subclass Item to create your own item classes.

class zyte_common_items.base.ProbabilityMixin(**kwargs)

Provides get_probability() to make it easier to access the probability of an item or item component that is nested under its metadata attribute.

get_probability() float | None

Returns the item probability if available, otherwise None.

class zyte_common_items.Item(**kwargs)

Base class for items.

_unknown_fields_dict: dict

Contains unknown attributes fed into the item through from_dict() or from_list().

classmethod from_dict(item: Dict | None)

Read an item from a dictionary.

classmethod from_list(items: List[Dict] | None, *, trail: str | None = None) List

Read items from a list.

get_probability() float | None

Returns the item probability if available, otherwise None.