Pages
Overview
The pages dataset is effectively the url-level dataset for any Publishers within Sincera. The pages dataset contains a significant amount of metadata regarding a given URL, including classification, categorization, and sentiment data for the most popular pages that are scanned on the platform.
Dataset
| Field | Type | Description |
|---|---|---|
| id | integer | identifier for the for the page object |
| publisher_id | integer | ID of the associated publisher to the page object. |
| url | string | URL of the page that was scanned. Note that this is the full URL, unlike the domain object that is used in other datasets, which is a top-level domain. |
| last_slot_scan | date | Date when the page was last scanned for ad slots. |
| last_pbjs_scan | date | Date when the page was last scanned for pbjs objects. |
| scan_count | integer | Count of how many times the page has been individually scanned. |
| publisher_assets_countnew | integer | Count of how many publisher assets (text, image, video) sincera has logged. |
| layout | string | if possible, determine the layout of the page (article or nil) - useful for discovering content-rich pages |
| valid_image_count | integer | count of valid images that can be used for contextual classification |
| invalid_image_count | integer | count of invalid images that cannot be used for contextual classification |
| total_images | integer | count of total images that Sincera has found on this page / url. |