Bright Data Brightbot Crawler
What is Brightbot?
Brightbot is Bright Data’s data collection crawler. Its purpose is to be the main data collection pipeline for all of Bright Data’s products and services. It has a built-in cache layer that prevents repetitive download of the same data over a 24h period, unless there is a unique business justification for an exception. It uses extensive technological measures in order to ensure fair use of the available website resources and prevent abuse. Brightbot activity is completely transparent, making use of its own unique user-agent and source IP subnet. Its traffic and activities can therefore be fully separated from user traffic, can be tracked and even controlled using Bright Data’s Web Master console and collectors.txt.
How can Brightbot traffic be identified?
Brightbot can easily be identified by two parameters:
- User-Agent: "Brightbot 1.0"
- Source IP subnet 82.97.199.0/24
Using both will ensure accurate identification.
Why should websites whitelist Brightbot?
- Filter out collection traffic from usage analytics and recommendation engine data.
- 99.99% protection from abuse - Once Brightbot is whitelisted by the website, Bright Data is ready to force all customers to work only through Brightbot, at the risk of losing some of them, in order to gain 100% control over the traffic.
- Reduce retry and duplicate resource requests.
- Separate traffic and limit resources that handle it.
- Acquire transparency into what others are looking at in your website - statistics and dashboards provided.
- Make much more informed decisions on what (if any) is moved to be pages behind a login, rather than public data.
What is the Web Master console?
Web master console is Bright Data’s way of communicating with website owners. It allows website owners to register their domain ownership within the console and gain transparency and control over the collection traffic targeting those domains.
The method of authentication is simple by adding a generated token into the domain DNS entry.
Within the statistics page of each domain, the owner can get domain health stats as measured by Bright Data and traffic stats.
Alerts
In the console Web Master can add alert rules that allow the user to track, and get alerted, when certain types of information is being accessed - For example, scanning the data for PII, accessing specific website endpoints, etc.
What is collectors.txt?
The collectors.txt file is a configurable resource provided by Bright Data's Web MasterConsole, that allows Web Master to define guidelines for ethical and efficient data collection from their websites. Its primary purpose is to enhance transparency and control by communicating specific access rules and limitations to Brightbott, Bright Data's web crawler. Web Master can use collectors.txt to specify endpoints containing Personally Identifiable Information (PII), disallow access to interactive elements like ad links or reviews, report organic traffic loads, update on copyright status of data and define peak traffic timeframes to prevent resource overloading. This file ensures that data collection aligns with privacy laws and resource constraints, promoting responsible interaction with the website. Once configured, Bright Data reviews the collectors.txt file, and Brightbot enforces the approved guidelines during its operations.
Protective Tech
Over the years Bright Data has added many features and layers of tech to help identify, prevent and mitigate intentional or accidental abuse of its network. Compliance tools, like KYC, will be detailed in the compliance section. Here we focus on automatic tech deployed for this purpose.
Health monitors (DDoS Protection)
For every domain targeted by any of Bright Data’s products, the system opens a health monitor. The health monitor tracks domain responsiveness 24/7 across geo-locations and time frames. Each health monitor also receives a feed of Bright Data aggregated traffic targeting the domain it is monitoring in real-time. If the monitor finds a correlation between Bright Data traffic and a degradation in the domain responsiveness it will enforce a rate limit corresponding to the last rate of traffic that had no adverse impact on the domain. This rate limit is cached and not removed.
Below is an example of such a case - the impact was identified and a rate limit enforced within 2 minutes. The red marker shows traffic that was subsequently blocked by Bright Data and website RTT coming back to normal.
Domain Classifications
Bright Data classifies every domain targeted by its customers on every product. More than 300,000 new domains are classified every day. Some classifications are permanently blacklisted like malware and phishing and some categories are blocked by default but allowed to target with special review and approval by compliance - like government agencies and NGOs.
Auth and Cookie blocks
By default Bright Data considers all data behind login to be private. As such in all visible traffic Bright Data blocks the use of authentication cookies and when using browsers Bright Data also blocks the ability to type passwords.
Special permission can be gained by submitting a request to compliancDCe - permission will be given in very rare cases where the owner of the data has specifically consented to the customer’s access.
Use case tracking
During KYC compliance records the target domains and verticals declared by the customer when petitioning for access to the residential proxy network.
After approval Bright Data keeps track of the customer’s usage and if it deviates from the declared use cases a flag is raised with the compliance team that will investigate with the customer.
Compliance & Ethics
- Acceptable Use Policy -
https://brightdata.com/trustcenter/acceptable-use-policy-bright-data - Bright Data KYC (Know Your Customer) Process -
https://brightdata.com/trustcenter/kyc - Usage Monitoring -
https://brightdata.com/trustcenter/proxy-services-verticals-usage-monitoring - Domain classification -
https://brightdata.com/trustcenter/ethical-network-use-classification - Abuse prevention and handling -
https://brightdata.com/trustcenter/abuse - Protecting the WWW -
https://brightdata.com/trustcenter/brightbot-ethical-web-data-guardian - Web Monitoring -
https://brightdata.com/trustcenter/ethical-web-data-collection-monitoring - Infosec -
https://brightdata.com/trustcenter/data-security-overview-protection-measures