Monday 17 December 2018

How to block Baidu using Cloudflare

Bots and crawlers sometimes don't play nice especially Baidu. It consumes your server bandwidth and resources. If you feel your website's content doesn't concern those in China, then it is better to block Baidu Spider from crawling your websites especially when it consumes your limited server resources. In my case, library OPAC using Koha Open Source Library System serve a lot of data about books and library resources which is queried from the database (e.g: maria DB or Mysql). Small libraries couldn't afford high-end server and pricey firewall. When crawled by spider extensively and repetitively like Baidu, mj12bot, Yandex... it consumes all the CPU resources.

Multiple bots, spiders and crawlers at the same time might have DDoS effect. CPU fully consumed and high RAM load.
Searching from library OPAC database requires a lot of CPU processing. Sometimes, the librarian couldn't do their daily tasks such as Cataloging, Check In and Check Out books due to server CPU and RAM completely occupied. Static web pages do not suffer from this kind of resource limitation.

Most of the advice is to set robots.txt and block spider from accessing your website's content. Rules such as:

User-agent: Baidu
User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /
However, when I view apache logs, Bad Bots such as Baidu spider keep ignoring the robots.txt rules and consume most of the server bandwidth. After some research, I have advised the library to sign up with Cloudflare (it is FREE by the way) for protection.

The following are steps to block Baidu Spider from accessing your websites.

A. Prerequisites

  • Sign Up with Cloudflare
  • Assign your domain nameserver to Cloudflare nameserver.
B. Block Baidu
  • Set your subdomain through "Cloudflare Orange Cloud"

  • Using Cloudflare Firewall, set the following Rules. Select Field="User Agent", Operator="Contains", Value="Baiduspider/2.0". Click Save.

  • Done.. 
Continuously monitor server resources and web server logs.. After few hours, you can check the firewall events.

Firewall events shows Baidu Spider successfully block

Details of Firewall events. Baidu has been successfully blocked.

Thanks to Cloudflare, the library system resources now is freed.

CPU resources now < 10%.

If you want to completely block all bots, you can try Firewall Rules: Field="User Agent", Operator="Contains", Value="bot". You may also specifically identify each bots and add rules for each bot. Since Cloudflare only gives 5 Rules for Free account, I combine all the bots I want to block in one rule.

Bots can be combine using OR. So I use only one rule to block bots which mean I have another 4.

I am not an expert in web security, just research, try and error. I hope this article helps those who face similar problems. 

No comments:

Post a Comment

How to use ChatGPT to create Input and Display Page

  Creating input and display data using PHP-MySQL involves several steps.  Step 1 - Create a MySQL database with tables that store the data ...