Pages

Overview

The pages dataset is effectively the url-level dataset for any Publishers within Sincera. The pages dataset contains a significant amount of metadata regarding a given URL, including classification, categorization, and sentiment data for the most popular pages that are scanned on the platform.

Dataset

Field	Type	Description
id	integer	identifier for the for the page object
publisher_id	integer	ID of the associated publisher to the page object.
url	string	URL of the page that was scanned. Note that this is the full URL, unlike the domain object that is used in other datasets, which is a top-level domain.
last_slot_scan	date	Date when the page was last scanned for ad slots.
last_pbjs_scan	date	Date when the page was last scanned for pbjs objects.
scan_count	integer	Count of how many times the page has been individually scanned.
publisher_assets_count^new	integer	Count of how many publisher assets (text, image, video) sincera has logged.
layout	string	if possible, determine the layout of the page (article or nil) - useful for discovering content-rich pages
valid_image_count	integer	count of valid images that can be used for contextual classification
invalid_image_count	integer	count of invalid images that cannot be used for contextual classification
total_images	integer	count of total images that Sincera has found on this page / url.