Publisher Text Assets

Overview

This dataset is a subset of the publisher_assets dataset, focusing specifically on text assets only found on the publisher page. This metadata is useful for generating cookieless, contextual targeting solutions, as well as extracting raw text assets to use in classification and categorization solutions.

Dataset

Field	Type	Description
id	integer	Unique identifier that applies to the detected GPID instance.
publisher_id	integer	Maps to the `id` primary key in the publishers dataset.
page_id	integer	Unique identifier for the specific publisher page asset where this GPID belongs.
asset	string	For the text-based Asset type, this includes the text extracted from the page, which has been scrubbed of html and is designed to be machine readable. for the Image type, this field includes a URL to the source of the asset.
asset_collection_date	DateTime	Timestamp for when the asset was collected.
classification	json	If the asset has been classified (via machine learning) the output will be included here. For both image and text-based assets, classification includes labeling, confidence scores, and brand safety data.
asset_classification_date	DateTime	Timestamp for when the classification of the asset was inferred.
created_at	datetime	Date when the scan was first generated by the Sincera platform.
updated_at	datetime	Date when the scan was last updated by the Sincera platform.
char_count	integer	Count of the text characters included on the page.
article_body	string	If the page type = “article” what is the refined and processed text of exclusively the article body.
article_body_chars	integer	Count of the text characters included in the article_body.
article_title	string	If page type = “article”, this is the title of the article.
article_excerpt	string	If page type = “article” this is an excerpt of the article.
url_text	array	Collection of text assets specific to the page URL.
meta_description	string	Provides metadata description on the page overall.
meta_page_title	string	Provides metadata description on the page title.
meta_keywords	string	Provides metadata description on keywords found on page.
declared_language	string	Indicates the language declared on the text assets.
meta_type	string	Indicates what the metadata specifically refers to (i.e. article, website).
emr_ingested	boolean	Indicates whether the data has been ingested into Amazon’s Elastic MapReduce for processing.
emr_completed	boolean	Indicates whether the data has successfully been uploaded into Amazon’s Elastic MapReduce for processing.