Crawl Profile
Overview
The crawl_profiles dataset contains the metadata information about a given Publishers crawling activity. This includes any specific settings for the publisher, such as login or consent selectors, as well as publisher-specific wait timers for events that are lazy-loaded by the browser on this particular publisher.
Dataset
| Field | Type | Description |
|---|---|---|
| id | integer | identifier for the for the specific crawl_profile. |
| publisher_id | integer | ID of the associated publisher to the crawl_profile. |
| crawl | boolean | flag to indicate whether or not Sincera is crawling this publisher. |
| categorize | boolean | Flag that indicates whether or not the publisher content should be categorized by Sincera’s ML engine. |
| crawl_depth | integer | Max depth for any individual crawl. Default is 50 unique urls. |
| timeout_count | integer | Number of times the publisher has timed out and was unable to be crawled. |
| sleep_timer | integer | Time (in seconds) as to how long the browser engine should wait before interacting with the publisher environment. If a publisher is experiencing low detection values, increasing the sleep_timer will help improve detection counts. |
| last_crawl_time | integer | Time (in milliseconds) as to how long it took for Sincera to crawl the publisher’s environment. Abnormally high crawl times are correlated some form of browser-blocking or collision, and result in lower-than expected detection results. |
| consent_selectors | array | Array of custom, publisher-specific selectors used for detecting consent modals on the publisher.domain. |
| login_selectors | array | Array of custom, publisher-specific selectors used for detecting login modals or divs on the publisher.domain. |
| last_shallow_crawl | datetime | Last time the publisher’s top-level-domain was visited by Sincera. |
| last_medium_crawl | datetime | Last date + time the publishers environment was visited by Sincera beyond the top level domain (additional pages crawled). |
| last_policy_scan | datetime | Last time the publisher’s ads.txt page was scanned. |
| last_webrisk_scan | datetime | Last time the publisher’s domain was scanned by Google’s Webrisk API. |