Link Search Menu Expand Document

Crawl Profile

Overview

The crawl_profiles dataset contains the metadata information about a given Publishers crawling activity. This includes any specific settings for the publisher, such as login or consent selectors, as well as publisher-specific wait timers for events that are lazy-loaded by the browser on this particular publisher.

Dataset

Field Type Description
id integer identifier for the for the specific crawl_profile.
publisher_id integer ID of the associated publisher to the crawl_profile.
crawl boolean flag to indicate whether or not Sincera is crawling this publisher.
categorize boolean Flag that indicates whether or not the publisher content should be categorized by Sincera’s ML engine.
crawl_depth integer Max depth for any individual crawl. Default is 50 unique urls.
timeout_count integer Number of times the publisher has timed out and was unable to be crawled.
sleep_timer integer Time (in seconds) as to how long the browser engine should wait before interacting with the publisher environment. If a publisher is experiencing low detection values, increasing the sleep_timer will help improve detection counts.
last_crawl_time integer Time (in milliseconds) as to how long it took for Sincera to crawl the publisher’s environment. Abnormally high crawl times are correlated some form of browser-blocking or collision, and result in lower-than expected detection results.
consent_selectors array Array of custom, publisher-specific selectors used for detecting consent modals on the publisher.domain.
login_selectors array Array of custom, publisher-specific selectors used for detecting login modals or divs on the publisher.domain.
last_shallow_crawl datetime Last time the publisher’s top-level-domain was visited by Sincera.
last_medium_crawl datetime Last date + time the publishers environment was visited by Sincera beyond the top level domain (additional pages crawled).
last_policy_scan datetime Last time the publisher’s ads.txt page was scanned.
last_webrisk_scan datetime Last time the publisher’s domain was scanned by Google’s Webrisk API.