Publisher Text Assets
Overview
This dataset is a subset of the publisher_assets dataset, focusing specifically on text assets only found on the publisher page. This metadata is useful for generating cookieless, contextual targeting solutions, as well as extracting raw text assets to use in classification and categorization solutions.
Dataset
| Field | Type | Description |
|---|---|---|
| id | integer | Unique identifier that applies to the detected GPID instance. |
| publisher_id | integer | Maps to the id primary key in the publishers dataset. |
| page_id | integer | Unique identifier for the specific publisher page asset where this GPID belongs. |
| asset | string | For the text-based Asset type, this includes the text extracted from the page, which has been scrubbed of html and is designed to be machine readable. for the Image type, this field includes a URL to the source of the asset. |
| asset_collection_date | DateTime | Timestamp for when the asset was collected. |
| classification | json | If the asset has been classified (via machine learning) the output will be included here. For both image and text-based assets, classification includes labeling, confidence scores, and brand safety data. |
| asset_classification_date | DateTime | Timestamp for when the classification of the asset was inferred. |
| created_at | datetime | Date when the scan was first generated by the Sincera platform. |
| updated_at | datetime | Date when the scan was last updated by the Sincera platform. |
| char_count | integer | Count of the text characters included on the page. |
| article_body | string | If the page type = “article” what is the refined and processed text of exclusively the article body. |
| article_body_chars | integer | Count of the text characters included in the article_body. |
| article_title | string | If page type = “article”, this is the title of the article. |
| article_excerpt | string | If page type = “article” this is an excerpt of the article. |
| url_text | array | Collection of text assets specific to the page URL. |
| meta_description | string | Provides metadata description on the page overall. |
| meta_page_title | string | Provides metadata description on the page title. |
| meta_keywords | string | Provides metadata description on keywords found on page. |
| declared_language | string | Indicates the language declared on the text assets. |
| meta_type | string | Indicates what the metadata specifically refers to (i.e. article, website). |
| emr_ingested | boolean | Indicates whether the data has been ingested into Amazon’s Elastic MapReduce for processing. |
| emr_completed | boolean | Indicates whether the data has successfully been uploaded into Amazon’s Elastic MapReduce for processing. |