Skip to main content


New page on seirdy.one: Scrapers I block (and allow), with explanations.

I’ve replaced all the comments in my robots.txt file with a more readable and detailed web page on scrapers I block. It includes info on the multiple blocking-approaches and criteria I use, commonly-blocked scrapers I allow, and more fact-checking than most of the more comprehensive alternatives.


#RobotsTxt #Scrapers #POSSE

Seirdy reshared this.

in reply to Seirdy

hat-tip to @jmjl for pointing me to OpenWebSearh.EU/Owler’s new “GenAI” product token and SemRush’s array of product tokens.
@klea
in reply to Seirdy

what the fuck??? there are people who fucking inject ads into sites without ads?????
in reply to solo

@solonovamax Very common form of malware, especially through browser extensions. Also a thing that happens on unencrypted connections.
@solo
in reply to Seirdy

oh, I thought you meant there were legitimate (non-malware) companies that do it
in reply to Seirdy

might want to edit it

you wrote

in that it’s a vague polite request with incentive for compliance.


I assume here you intended to write "with no incentive for compliance"

in reply to Seirdy

Update: I added NoCache to my X-Robots and documented why.
This entry was edited (1 year ago)
in reply to Seirdy

yo, I'm looking into additional things to block in the robots.txt for my website (basing a decent bit of it off of yours, plus any additional stuff I find), and I felt like I'd want to just throw this your way

I'm personally planning to block everything from the first url, as well as the following from the second url

  • all AI related tools
  • several of the "Intelligence Gatherers"
  • possibly several of the "Scrapers"

I would also like to note, BLEXBot is listed on the second site as an "SEO Crawler" and it indicates that it does not believe it is AI-related nvm, I mis-remembered and thought you had blocked it due to it being AI-related

I'll mention any other resources as I find them.

This entry was edited (1 year ago)
in reply to solo

@solonovamax I’m aware if those resources. They are error prone. I cite them in the end.
@solo
in reply to Seirdy

ah, I see
did not look at the things you cite lol

do you have some examples of things that are incorrect?

there are several on that list that would probably be good to block, which you don't block (unsure if they're actually used in practice anymore or if they're just historical), such as

  • Claude-Web
  • cohere-ai
  • anthropic-ai

there is also the aiHitBot one that I mentioned

in reply to solo

@solonovamax Several of those are LLM clients but not actually used to train LLMs. They are search crawlers used to power links in search engines that use third-party pre-trained LLMs. I decided that opting out of those only stops my site from being cited, but doesn’t stop LLM training.
@solo
in reply to Seirdy

@solonovamax I describe my criteria in the “Criteria for bad-bot blocking” section, and elaborate in “Exceptions: scrapers I allow, despite meeting some block-criteria”
@solo