The Challenge
A customer has a public website that is often visited by automated bots.
An automated bot, or simply bot, is a software application or script that performs tasks on the internet in an automated and repetitive manner. Bots are designed to perform various functions and are typically programmed to interact with websites, applications, or other online services. They can be created for both legitimate and malicious purposes.
The problem which the customer faces is that some bots or other unidentified web clients consume the resources of its web servers and make them less responsive.
The customer asked for a solution to this problem. The solution should preferably not require any changes to the existing web server infrastructure.
The Solution
We chose AWS WAF because it is easy to integrate into the customer’s setup and offers a predefined set of rules managed by AWS (AWS WAF Bot Control rule group), that can detect different categories of bots. In addition, it is possible to use rate-based rules and rules based on specific request attributessuch as HTTP methods, URL paths, HTTP headers or IP address ranges.
Bot categories
List of all bot categories known to AWS WAF Bot Control:
Category: Advertising
Description: Bots that are used for advertising purposes
Category: Archiver
Description: Bots that are used for archiving purposes
Category: Content Fetcher
Description: Bots that are fetching content on behalf of a user
Category: Email Client
Description: Email clients
Category: Http Library
Description: HTTP libraries that are used by bots
Category: Link Checker
Description: Bots that check for broken links
Category: Miscellaneous
Description: Miscellaneous bots
Category: Monitoring
Description: Bots that are used for monitoring purposes
Category: Scraping Framework
Description: Web scraping frameworks
Category: Search Engine
Description: Search engine bots
Category: Security
Description: Security-related bots
Category: SEO
Description: Bots that are used for search engine optimization
Category: Social Media
Description: Bots that are used by social media platforms to provide content summaries
Category: AI
Description: Artificial intelligence (AI) bots
Category: Automated Browser
Description: Inspects the request’s token for indicators that the client browser might be automated
Category: Known Bot Data Center
Description: Inspects for data centers that are typically used by bots
Category: Non Browser User Agent
Description: Inspects for user agent strings that don’t seem to be from a web browser
Preparations
The next steps are performed in a dedicated test environment before the setup is deployed to production.
We start by writing some terraform code that creates an Application Load Balancer (ALB) to attach the WAF. We put the existing web servers into a target group to which the ALB will route its traffic. The ALB will play the role of an additional reverse proxy layer in front of the web servers.
We also create a Route 53 Alias A record pointing to the ALB and a TLS certificate in the AWS Certificate Manager (ACM) for that DNS record, since we only allow HTTPS traffic at the ALB. HTTP traffic is permanently redirected to HTTPS using HTTP Status code 302 (the HTTP Strict-Transport-Security response header is handled by the caddy reverse proxy layer behind the ALB.)
In addition, the ALB and WAF logs are configured to be sent to Datadog.
Adding WAF rules
Filtering specific URLs
First, we add a filtering rule to the WAF (block_webfonts_path) that blocks the URI path /webfonts/2775253ec3a5.css to a resource that no longer exists on the web servers but is still frequently referenced by some misconfigured resources.
Bot Control
Then we add the AWSManagedRulesBotControlRuleSet AWS WAF Bot Control managed rule group to the WAF and select the “common” inspection level. Using this inspection level, bots are detected by the WAF by statically analyzing request data.
The second rule, bot-control-default, will be processed next if the first rule does not match.
This rule will do the following:
Should a request match against a signature of a verified bot, for example Microsoft’s Bing search engine, the request will be labeled by the WAF and the next rule will be processed (rules that only add labels as an action are not terminating rules, which means that the next rule is processed even if there’s a match).
If the rule matches, the following labels are added:
Should the WAF not be able to match a request against one of its signatures of verified bots, it will label it as unverified and block the request. No further WAF rule will be processed.
In this scenario, where our main goal is to prevent the web servers from running out of resources, we start with the most restrictive bot control rule set. We do not only block unverified bots, we also want to block verified bots. This is done with rule bot-control-block-verified.
This rule looks for requests which are labeled awswaf:managed:aws:bot-control:bot:verifiedand blocks them.
Rate-based rule
As the last rule, we add a rate-based rule (rate-based-captcha) with the following condition:
If a specific source IP sends more than 500 requests in 5 minutes, that client will be presented with a CAPTCHA for 5 minutes.
Final rule set
Our final set of rules looks like this:
Rule name: block_webfonts_path
Rule priority: 0
Purpose: Blocks URI path
Rule name: bot-control-default
Rule priority: 1
Purpose: Blocks unverified bots, labels verified bots
Rule name: bot-control-block-verified
Rule priority: 2
Purpose: Blocks verified bots by matching label
Rule name: rate-based-captcha
Rule priority: 3
Purpose: Blocks source IPs that exceed the rate limit
After installing the ruleset on the test environment, we waited for a few days to collect enough logs to be able to generate some traffic statistics and also to check for false positives (WAF blocks requests that should be allowed) or false negatives (WAF fails to detect harmful traffic).
We found no problems, which allowed us to point the DNS record of the production environment to the production ALB where the WAF is also attached. The DNS change now routes the traffic of the production environment to the WAF.
Datadog WAF Logs Example Views
Datadog showing requests blocked by the WAF using rule block_webfonts_path in the last 24 hours.
Datadog statistics for blocked requests from unverified bots for the last 24 hours.
Datadog statistics for blocked requests from verified bots for the last 24 hours.
Results and Benefits
By using this set of rules, we were able to reduce the amount of unwanted requests reaching the web servers and could so prevent further outages.
About PCG
Public Cloud Group (PCG) supports companies in their digital transformation through the use of public cloud solutions.
With a product portfolio designed to accompany organisations of all sizes in their cloud journey and competence that is a synonym for highly qualified staff that clients and partners like to work with, PCG is positioned as a reliable and trustworthy partner for the hyperscalers, relevant and with repeatedly validated competence and credibility.
We have the highest partnership status with the three relevant hyperscalers: Amazon Web Services (AWS), Google, and Microsoft. As experienced providers, we advise our customers independently with cloud implementation, application development, and managed services.