Link Search Menu Expand Document

Publisher Text Assets

Overview

This dataset is a subset of the publisher_assets dataset, focusing specifically on text assets only found on the publisher page. This metadata is useful for generating cookieless, contextual targeting solutions, as well as extracting raw text assets to use in classification and categorization solutions.

Dataset

Field Type Description
id integer Unique identifier that applies to the detected GPID instance.
publisher_id integer Maps to the id primary key in the publishers dataset.
page_id integer Unique identifier for the specific publisher page asset where this GPID belongs.
asset string For the text-based Asset type, this includes the text extracted from the page, which has been scrubbed of html and is designed to be machine readable. for the Image type, this field includes a URL to the source of the asset.
asset_collection_date DateTime Timestamp for when the asset was collected.
classification json If the asset has been classified (via machine learning) the output will be included here. For both image and text-based assets, classification includes labeling, confidence scores, and brand safety data.
asset_classification_date DateTime Timestamp for when the classification of the asset was inferred.
created_at datetime Date when the scan was first generated by the Sincera platform.
updated_at datetime Date when the scan was last updated by the Sincera platform.
char_count integer Count of the text characters included on the page.
article_body string If the page type = “article” what is the refined and processed text of exclusively the article body.
article_body_chars integer Count of the text characters included in the article_body.
article_title string If page type = “article”, this is the title of the article.
article_excerpt string If page type = “article” this is an excerpt of the article.
url_text array Collection of text assets specific to the page URL.
meta_description string Provides metadata description on the page overall.
meta_page_title string Provides metadata description on the page title.
meta_keywords string Provides metadata description on keywords found on page.
declared_language string Indicates the language declared on the text assets.
meta_type string Indicates what the metadata specifically refers to (i.e. article, website).
emr_ingested boolean Indicates whether the data has been ingested into Amazon’s Elastic MapReduce for processing.
emr_completed boolean Indicates whether the data has successfully been uploaded into Amazon’s Elastic MapReduce for processing.