Seirdy

Seirdy@pleroma.envs.net

Skim before following: seirdy.one/about/fediverse-gre…. It describes how I accept follow requests, block people, etc.

puppy.

I follow and unfollow extremely liberally.

Interested in fitness, nutrition. accessibility, privacy, security.

I am made of microplastics and can be trusted with your forklift.

tech-stuff: check my "uses" page: seirdy.one/about/uses/
Other tech interests in no particular order: linked data, the #IndieWeb, the #Gemini protocol (more into the community than the technology).

Politics: Leftist, capitalism bad, anti-consumerism. Vegan.

Neuro-atypicality: #anxiety, #ADHD, #ActuallyAutistic.

:QueerCat_Pansexual: :neodog_flag_androgyne:

don't flirt unless i said it's ok

[Verifying my OpenPGP key: openpgp4fpr:AC6AF1F838DF3DCC2E47A6CF1E892DB2A5F84479]

Opinions are those of your employer.

akkoma

Seirdy

1 year ago • •

Seirdy
1 year ago • •

New page on seirdy.one: Scrapers I block (and allow), with explanations.

I’ve replaced all the comments in my robots.txt file with a more readable and detailed web page on scrapers I block. It includes info on the multiple blocking-approaches and criteria I use, commonly-blocked scrapers I allow, and more fact-checking than most of the more comprehensive alternatives.

#RobotsTxt #Scrapers #POSSE

Seirdy reshared this.

in reply to Seirdy

Seirdy

in reply to Seirdy • 1 year ago • •

hat-tip to @jmjl for pointing me to OpenWebSearh.EU/Owler’s new “GenAI” product token and SemRush’s array of product tokens.

@klea

in reply to Seirdy

Seirdy

in reply to Seirdy • 1 year ago • •

Any suggestions welcome

in reply to Seirdy

solo

in reply to Seirdy • 1 year ago • •

what the fuck??? there are people who fucking inject ads into sites without ads?????

Seirdy likes this.

in reply to solo

Seirdy

in reply to solo • 1 year ago • •

@solonovamax Very common form of malware, especially through browser extensions. Also a thing that happens on unencrypted connections.

@solo

in reply to Seirdy

solo

in reply to Seirdy • 1 year ago • •

oh, I thought you meant there were legitimate (non-malware) companies that do it

in reply to solo

Seirdy

in reply to solo • 1 year ago • •

@solonovamax I’m sure some niche Chromium-based browsers do.

@solo

in reply to Seirdy

solo

in reply to Seirdy • 1 year ago • •

might want to edit it

you wrote

in that it’s a vague polite request with incentive for compliance.

I assume here you intended to write "with no incentive for compliance"

Seirdy likes this.

in reply to solo

Seirdy

in reply to solo • 1 year ago • •

@solonovamax aaaa i can’t change it rn. tonight i will.

@solo

in reply to Seirdy

Seirdy

in reply to Seirdy • 1 year ago • •

Update: I added NoCache to my X-Robots and documented why.

This entry was edited (1 year ago)

in reply to Seirdy

solo

in reply to Seirdy • 1 year ago • •

yo, I'm looking into additional things to block in the robots.txt for my website (basing a decent bit of it off of yours, plus any additional stuff I find), and I felt like I'd want to just throw this your way

I'm personally planning to block everything from the first url, as well as the following from the second url

all AI related tools
several of the "Intelligence Gatherers"
possibly several of the "Scrapers"

~~I would also like to note, BLEXBot is listed on the second site as an "SEO Crawler" and it indicates that it does not believe it is AI-related~~ nvm, I mis-remembered and thought you had blocked it due to it being AI-related

I'll mention any other resources as I find them.

This entry was edited (1 year ago)

in reply to solo

Seirdy

in reply to solo • 1 year ago • •

@solonovamax I’m aware if those resources. They are error prone. I cite them in the end.

@solo

in reply to Seirdy

solo

in reply to Seirdy • 1 year ago • •

ah, I see
did not look at the things you cite lol

do you have some examples of things that are incorrect?

there are several on that list that would probably be good to block, which you don't block (unsure if they're actually used in practice anymore or if they're just historical), such as

Claude-Web
cohere-ai
anthropic-ai

there is also the aiHitBot one that I mentioned

in reply to solo

Seirdy

in reply to solo • 1 year ago • •

@solonovamax Several of those are LLM clients but not actually used to train LLMs. They are search crawlers used to power links in search engines that use third-party pre-trained LLMs. I decided that opting out of those only stops my site from being cited, but doesn’t stop LLM training.

@solo

in reply to Seirdy

Seirdy

in reply to Seirdy • 1 year ago • •

@solonovamax I describe my criteria in the “Criteria for bad-bot blocking” section, and elaborate in “Exceptions: scrapers I allow, despite meeting some block-criteria”

@solo

⇧