This article describes my nginx configuration and strategy on how to prevent web crawlers from putting down my instance while still serving most people with minimal amount of friction.

TL;DR:

Put that in your nginx config:

location / {
  # needed to still allow git clone from http/https URLs
  if ($http_user_agent ~* "git/|git-lfs/") {
        set $bypass_cookie 1;
  }
  # If we see the expected cookie; we could also bypass the blocker page
  if ($cookie_Yogsototh_opens_the_door = "1") {
        set $bypass_cookie 1;
  }
  # Redirect to 418 if neither condition is met
  if ($bypass_cookie != 1) {
     add_header Content-Type text/html always;
     return 418 '<script>document.cookie = "Yogsototh_opens_the_door=1; Path=/;"; window.location.reload();</script>';
  }
  # rest of your nginx config

Preferably run a string replace from Yogsototh_opens_the_door to your own personal Cookie name.

Main advantage, is that it is almost invisible to the users of my website compartively to other solutions like Anubis.

More detail

Not so long ago I started to host my code to forgejo. There is a promise that in the future it will support federation and forgejo is the same project that is used for codeberg.

The only problem I had was one day, I discovered that my entire node was down. At first I didn't investigate and just restarted the node. But soon after a few hours, it was down again. Looking at the reason, clearly thousands of requests that looked at every commit which put too much pressure on the system. Who could be so interested in using the web API to look at every commit instead, of… you know, clone the repository locally and explore it. Quickly, yep, like so many of you, I discovered that tons of crawlers that did not respect the robots.txt are crawling my forgejo instance until death ensues.

So I had no choice, I first used a radical approach and blocked my website entirely except from me. But hey, why having a public forge if not for people to be able to look into it time to time?

I then installed Anubis, but it wasn't really for me. It is way too heavy for my needs, not as easy as I would have hoped to configure and install.

Then I saw this article You don't need anubis on lobste.rs using a simple configuration in caddy that should block these pesky crawlers. I made some adjustments to adapt it to nginx. For now, this is working perfectly well, my users are just redirected once, without really noticing it. And they could use forgejo as they could before. And this puts the crawlers away.

The strategy is pretty basic; in fact, a lot less advanced than the strategy adopted by Anubis. For every access of my website, I just check if the user has a specific cookie set. If not, I redirect the user to a 418 HTML page containing some js code to execute that set this specific cookie and reload the page.

That's it.

I also tried to return a 302 and add a cookie from the response without javascript, but the crawlers are immune to that second strategy. Unfortunately this means, my website could only be seen if you enable javascript in your browser. I feel this is acceptable. I guess, someday this very basic protection will not be enough and my forgejo instance will break again, and I will be forced to use more advanced system like Anubis or perhaps even iocaine.

I hope this could be helpful, because, I recently saw many discussions on that subject where people were not totally happy to use Anubis, while at least for me, this quick dirty fix does the trick. And I am fully aware that this would be very easy to bypass. But for now, I think the volume is more important than the quality for these crawlers and it may take a while for them to need to adapt. Also, by publishing this, I know if too many people use the same trick, quickly, these crawlers will adapt.