TL;DR:
Put that in your nginx config:
location / {
# needed to still allow git clone from http/https URLs
if ($http_user_agent ~* "git/|git-lfs/") {
set $bypass_cookie 1;
}
# If we see the expected cookie; we could also bypass the blocker page
if ($cookie_Yogsototh_opens_the_door = "1") {
set $bypass_cookie 1;
}
# Redirect to 418 if neither condition is met
if ($bypass_cookie != 1) {
add_header Content-Type text/html always;
return 418 '<script>document.cookie = "Yogsototh_opens_the_door=1; Path=/;"; window.location.reload();</script>';
}
# rest of your nginx config
Preferably run a string replace from
Yogsototh_opens_the_door to your own personal Cookie
name.
Main advantage, is that it is almost invisible to the users of my
website compartively to other solutions like Anubis.
More detail
Not so long ago I started to host my code to forgejo. There is a promise that in the
future it will support federation and forgejo is the same project that
is used for codeberg.
The only problem I had was one day, I discovered that my entire node
was down. At first I didn't investigate and just restarted the node. But
soon after a few hours, it was down again. Looking at the reason,
clearly thousands of requests that looked at every commit which put too
much pressure on the system. Who could be so interested in using the web
API to look at every commit instead, of… you know, clone the repository
locally and explore it. Quickly, yep, like so many of you, I discovered
that tons of crawlers that did not respect the robots.txt
are crawling my forgejo instance until death ensues.
So I had no choice, I first used a radical approach and blocked my
website entirely except from me. But hey, why having a public forge if
not for people to be able to look into it time to time?
I then installed Anubis, but it wasn't really for me. It is way too
heavy for my needs, not as easy as I would have hoped to configure and
install.
Then I saw this article You
don't need anubis on lobste.rs using
a simple configuration in caddy that should block these pesky crawlers.
I made some adjustments to adapt it to nginx. For now, this is working
perfectly well, my users are just redirected once, without really
noticing it. And they could use forgejo as they could before. And this
puts the crawlers away.
The strategy is pretty basic; in fact, a lot less advanced than the
strategy adopted by Anubis. For every access of my website, I just check
if the user has a specific cookie set. If not, I redirect the user to a
418 HTML page containing some js code to execute that set this specific
cookie and reload the page.
That's it.
I also tried to return a 302 and add a cookie from the response
without javascript, but the crawlers are immune to that second strategy.
Unfortunately this means, my website could only be seen if you enable
javascript in your browser. I feel this is acceptable. I guess, someday
this very basic protection will not be enough and my forgejo instance
will break again, and I will be forced to use more advanced system like
Anubis or perhaps even iocaine.
I hope this could be helpful, because, I recently saw many
discussions on that subject where people were not totally happy to use
Anubis, while at least for me, this quick dirty fix does the trick. And
I am fully aware that this would be very easy to bypass. But for now, I
think the volume is more important than the quality for these crawlers
and it may take a while for them to need to adapt. Also, by publishing
this, I know if too many people use the same trick, quickly, these
crawlers will adapt.