Guests on our own web

Sat, 16 May 2026 07:58:18 +0000

A few months ago I spun up a new VPS on Linode, London datacentre. Nothing special – Debian, Nginx, a Let's Encrypt certificate, a domain I was going to use for my daily notes and my homelab experiments. No link posted anywhere, no entries in my feeds, no backlinks from the sites I run. Just a freshly assigned IP, from a subnet that a week earlier had belonged to someone else.

The one thing I had configured carefully was the logs: nginx with an extended format, journald with audit, a few baseline fail2ban jails. I wanted to see what happens to a server that doesn't yet have a life, before I gave it one. Twenty-four hours later, I opened the logs. No humans. That was expected – I hadn't told anyone the domain. But there was already a small zoo of other presences. A wget from a Polish VPS with a phantom reverse DNS, the kind registered with a placeholder that never got updated. Three GETs, same resource, thirty-six second intervals. Then nothing. An SSH scan on port 80 – yes, an SSH scan on the HTTP port – written in Go, with a user-agent that claimed to be Mozilla/5.0 but was negotiating TLS the way only Go's crypto libraries do. VisionHeight, a commercial scanner that bills itself as ethical, mapped seven ports in two and a half minutes. Censys came through twice, identifying itself, leaving its own PTR and a link to its opt-out page. A Common Crawl crawler. GPTBot. ClaudeBot. AppleBot.

People: zero.

I spent the evening watching those logs the way you'd watch a sequence of read-heads on a tape. It was like opening the door to a flat you'd just rented and finding it already occupied by intruders. This is a public network, they seemed to be saying, and nobody told you what public means.

Since then I've done what everyone does: I've built defences. nftables to drop ASNs known for aggressive scanning. fail2ban with custom jails for nginx that recognise the patterns of the noisier scans – probes against /wp-login.php on a server that doesn't run WordPress, attempts at /.env, requests for phpMyAdmin paths that don't exist. GoAccess to visualise what little organic traffic remains once the rest is filtered out. An alert system over ntfy for out-of-band anomalies. It is routine – every sysadmin running a homelab has their own variant. But building it calmly, rather than as a patch on something that has already fallen over, is precisely what gets you to look at things that would otherwise scroll past, filtered away.

And while I was building it, a question came to mind, maybe a banal one, an extremely banal one: who am I doing all this for?

Not for the readers – those are few, almost none, they arrive via RSS, shared links, the occasional search engine. I was defending the server from a network that is predominantly non-human. I was configuring jails for scanners that don't know me, for crawlers that don't read me, for botnets that don't particularly mean me harm – they mean harm to anyone reachable on a port 22 or 80.

The threshold: 51% (and already 53%)

In 2024, for the first time in ten years, bot-generated traffic surpassed human traffic on the internet. Fifty-one percent against forty-nine. The figure comes from Imperva's Bad Bot Report, 2025 edition, the twelfth in the annual series – the analysis is based on thirteen trillion requests blocked by their global mitigation network in 2024 alone. It is the number that best sums up where we have ended up.

The 2026 Bad Bot Report, published a few weeks ago with 2025 data, has updated the figure: 53% bots, 47% humans. Another point and a half lost in twelve months. It did not happen all at once. Here is the historical series, from 2015 onwards:

Year	Humans	Bad bots	Good bots
2015	54%	27%	19%
2018	62%	22%	17%
2020	59%	26%	15%
2022	53%	30%	17%
2023	50%	32%	18%
2024	49%	37%	14%
2025	47%	n/d	n/d

(Source: Imperva, Bad Bot Report 2025 and 2026)

Humans have lost seven percentage points in ten years. The erosion is slow and steady – a descending curve measured in years, not in months. Nobody cut a ribbon to announce we have crossed the threshold. It was a gradual shift of the axis, a median that moved while we were looking elsewhere. Meanwhile, the bad bots grew from 27% to 37%. Ten percentage points in ten years, all on the predatory side. Brute force, credential stuffing, data scraping, account takeover, API fraud. Imperva records that ATOs – Account Takeover Attacks – grew by 40% in 2024 alone, and in 2025 the financial sector absorbed 46% of all ATO incidents worldwide. And, dulcis in fundo, the “good bots” – Googlebot, Bingbot, the legitimate aggregators, the health checkers – went down. From 19% in 2015 to 14% in 2024. The indexing services that historically justified bandwidth consumption have lost ground: the network has become more automated, but in a direction that does not pay off for those who publish.

Cloudflare confirms this with independent data. Their Radar Year in Review 2025, published at the end of December, reports that global internet traffic grew by 19% in 2025, and a substantial share of that growth is attributable to bots and AI crawlers. Googlebot still dominates – around 28% of verified traffic – but the new generation is gaining fast: OpenAI's GPTBot went from 4.7% in July 2024 to 11.7% in July 2025. ChatGPT-User, the bot that acts on explicit user command, recorded a year-on-year growth of 2,825% in request volume. That is not a typo. PerplexityBot, even more extreme: +157,490%.

The 51% threshold has to be read in this context. The curve has been rising for years, and 2024 is not the peak. The network we are using today is not the 2015 network with a few more bots: it is a structurally different network, where humans have gone from being the main signal to being the background noise.

Who is talking in this network?

When you say “bot” you do not say one thing. The presences in the logs belong to three families that do different jobs, have different economies, and produce different pressures on the infrastructure. Three main categories, then.

The cartographers. These are the scanners that map the entire IPv4 space – four billion three hundred million addresses – across all or nearly all known ports, and maintain queryable databases of exposed services. The founding project is ZMap, released in 2013 by a team at the University of Michigan. ZMap is a port scanner that can scan the entire IPv4 space on a single port in under 45 minutes from a single machine, from userspace, over a gigabit connection. Technically remarkable: it cuts by an order of magnitude the time needed to “see” all of the internet. Censys was built on top of ZMap, launched in 2015 by the same authors. Censys continuously scans IPv4, collects TLS certificates, service banners, software fingerprints, and keeps everything in a queryable commercial database. Shodan, founded in 2009 by John Matherly, is the conceptual predecessor: less polished technically, but longer-lived and more deeply rooted in sysadmin culture. Rapid7's Project Sonar, ZoomEye, Fofa, Netlas – all follow the same logic.

In 2012, an anonymous researcher decided he wanted to census the internet but did not have the bandwidth. He built an illegal botnet of compromised routers – the Carna Botnet – and ran the first Internet Census: he published the dataset online and declared his own offences. It remains a case study in the asymmetry between technical capability and legality – what Censys does today from a datacentre, ten years ago was a federal crime in the United States. The scanners describe themselves as ethical. They respect abuse@, publish their methodology, exclude networks on request, leave identifiable PTRs. All true. But the data they produce – the complete, near-real-time map of what is exposed on the internet – is sold by subscription, and the clients include academic research and the surveillance industry, corporate threat intelligence and aspiring attackers with seventy-nine dollars a month for a base account. In 2014, at Def Con 22, researchers Dan Tentler, Paul McMillan and Robert Graham ran a live IPv4 scan on port 5900 looking for VNC servers without authentication. They found thirty thousand systems accessible without a password. Among them: two hydroelectric power stations, the cameras of a Czech casino, industrial control systems, ATMs, a caviar production plant. The map exists because producing it is cheap, and who consults it – for what purposes, with what consequences – is a consequence of that price, not the reason for the project.

The extractors. These are the AI crawlers. They existed in embryonic form before too – Common Crawl for years, the indexing archives of search engines forever – but since November 2022, with the release of ChatGPT, they have changed in nature and in volume.

Cloudflare's data, collected from a fixed sample of clients to eliminate the growth bias, is explicit. Between July 2024 and July 2025:

GPTBot (OpenAI): from 4.7% to 11.7% of total crawler traffic
ClaudeBot (Anthropic): from 6% to nearly 10%
Meta-ExternalAgent (Meta): from 0.9% to 7.5%
PerplexityBot (Perplexity): growth of 157,490%
Bytespider (ByteDance): declining, from 14.1% to 2.4%

The most revealing figure is the composition by purpose. Cloudflare classifies AI crawling into three categories: training (data collection to train models), search (indexing for chat search), user action (visits on explicit user command). Over the past twelve months, 80% of AI crawling has been for training. 18% for search. 2% for user action. In the most recent six months the training share has risen further, to 82%. The overwhelming majority of the work these bots do around the web does not, then, serve network mapping – it serves to extract content, process it, and turn it into training data for models that will then sell access or use the output to generate responses that compete with the originating site.

Another Cloudflare metric measures the imbalance directly: the crawl-to-refer ratio, that is, how many requests a bot makes versus how much traffic it then sends back to the source site. In July 2025, Anthropic was crawling 38,000 pages for every human visitor it sent back – a clear improvement on the 286,000:1 ratio recorded in January of the same year, but still the most lopsided extreme among the major AI platforms. OpenAI in the same period was running at around 1,500:1. Perplexity 194:1. The economic model is asymmetric extraction: take a lot, give back little.

The parasites. These are the bad bots in the strict sense: 37% of total internet traffic in 2024. Thirteen trillion requests blocked by Imperva's network alone in that year.

Here the composition changes. Imperva observes that “simple” attacks – basic scripts, dictionary attacks, automated scans – grew from 40% to 45% in 2024. The report explicitly attributes this growth to the arrival of generative AI: tools like ChatGPT, Claude, Llama have lowered the technical barrier to writing a brute forcer, a credential stuffer, a malicious crawler. What ten years ago required Perl and John the Ripper today requires a prompt and ten minutes. 31% of total attacks recorded by Imperva fall into one of the twenty-one OWASP Automated Threats categories. 44% of advanced bot traffic attacks APIs, no longer web pages – because APIs expose business logic with fewer defences and more value. 21% of attacks use residential proxies: IP addresses belonging to real domestic connections, rented on the grey market, allowing the bot to blend in as legitimate user traffic. Geo-fencing, per-IP rate limiting, ASN blacklists – all useless against an attacker who routes traffic through a residential fibre line in Milan.

One detail demolishes a widespread myth. Attackers usually do not want to take a site down. They want to use it. A compromised site is worth more alive than dead: as a host for phishing, cryptocurrency mining, botnet command-and-control, traffic redirect for black-hat SEO, file storage for warez. When the site falls, the attacker has done something wrong – they have saturated resources, triggered detection, burned their foothold. Akamai regularly publishes reports that confirm this: the economic model of the malicious bot is the long stay, not the raid. This changes the reading of visible symptoms. If a site falls over with intermittent 502s, the structural explanation is almost always: saturation of a PHP-FPM pool due to medium-scale bot traffic, on infrastructure that was not dimensioned to absorb half the internet knocking at the same time. The political explanation – they are attacking us to silence us – is almost always false, because anyone who knows how to attack seriously does not let the site fall over.

In June 1994 Martijn Koster, a Dutch sysadmin running the early web crawlers for ALIWEB, proposed a convention: a text file at the root of the site, robots.txt, in which the operator could declare which parts of their domain crawlers were kindly asked not to visit. No central authority would enforce it, no network protocol would verify it. It was a gentleman's pact, full stop. It worked because in the nineties crawlers were few, they were run by people who knew each other, and nobody had an economic interest strong enough to burn their reputation by ignoring a directive. For thirty years it held. Googlebot, Bingbot, Yandex, Common Crawl – all respected robots.txt as part of the basic etiquette of indexing. It was so established that the formal specification only arrived in 2022 (RFC 9309), decades after the daily practice. When the IETF standardised it, they did so to document a consolidated practice, not to create a new one.

That pact, in the last three years, has been broken.

Drew DeVault, founder of SourceHut – the niche git platform much loved by those who do not want to be on GitHub – published a post in March 2025 that became a manifesto, titled Please stop externalising your costs directly into my face. The piece describes, with technical coldness, the behaviour of LLM crawlers:

they crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit of every repository, and they do this using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each making no more than one HTTP request over any window we tried to measure – actively and maliciously adapting and blending in with legitimate traffic to evade any attempt at characterisation or blocking

It is the description of a distributed DDoS attack carried out by companies that present themselves as legitimate consumers of bandwidth. SourceHut had to unilaterally block entire cloud providers – Google Cloud, Microsoft Azure – because it was the only viable defence.

The Wikimedia Foundation, in April 2025, published data that complements this. Since January 2024, the bandwidth consumed by media downloads on Wikimedia Commons has grown by 50%. The increase does not come from new human readers: it comes from AI scrapers vacuuming up the entire catalogue of 144 million open-licence files. Wikimedia has quantified it: 65% of the most expensive traffic hitting the central datacentres is bot-generated, even though bots account for only 35% of total pageviews. The bots read in bulk – they request obscure pages that the regional cache does not have, forcing the infrastructure to fetch them from the centre. A human reader costs little; an AI crawler costs a lot, and the cost-to-benefit ratio for the body hosting the content has become unsustainable. The Foundation has set as a 2025/2026 annual goal: “reduce by 20% the traffic generated by scrapers”. An organisation that hosts the largest free encyclopaedia in the world is forced to invest engineering in repelling those who want to read it.

The KDE project's GitLab went down temporarily because of a crawler coming from Alibaba IP ranges. GNOME's GitLab installed Anubis, a proof-of-work challenge written by Xe Iaso – on arrival at the page, the browser has to solve a small computational problem before the content is shown. Costs nothing to a human, costs dearly to a bot that has to do millions a day. The numbers published by Bart Piotrowski, GNOME's sysadmin, after switching on Anubis: in two and a half hours, 81,000 total requests, of which only 3% made it through the proof-of-work. 97% were bots. Anubis' default loading screen shows a girl in anime style – it is an explicitly provocative aesthetic choice by Iaso, who has declared he wanted to make the experience annoying for those using these tools to extract.

Kevin Fenzi, who administers Fedora's infrastructure, has blocked traffic from entire countries. Drew DeVault, in the same post, writes:

Every time I sit down for a beer with my friends and fellow sysadmins, it is not long before we start complaining about the bots and asking each other whether the other has found the definitive way to get rid of them. The desperation in these conversations is palpable

It is the first-person chronicle of a technical community that has watched a thirty-year cooperative protocol break in thirty-six months.

Anthropic, OpenAI and the others publicly respond that they respect robots.txt. The sysadmins' logs say otherwise. Cloudflare, in its December 2025 report, writes unambiguously that “crawling activity can be aggressive, often ignoring the directives found in robots.txt files”. The structural problem is simple: robots.txt never had an enforcement mechanism. It rested on reputation. For those extracting data today to train AI models, the dataset is worth more than the reputation lost by ignoring it.

What it means for those who publish

For anyone running a small site – a blog, an online magazine, a collective's server, a personal homelab – the 51% (and more) figure translates into a daily operational reality that those who do not administer do not see. A server receives, in proportion, the same kind of bot traffic as the New York Times. Not the same volume, of course – but the same mix. GPTBot downloads wp-content, Censys maps the ports, some botnet tries credentials against three or four well-known WordPress endpoints. Even publishing three articles a month to a readership of two hundred people, you end up statistically anonymous, inside a scanning distribution that is uniform across all of IPv4.

This produces two effects.

The first is that the technical barrier to publishing on one's own has grown. In the 2000s it was enough to install WordPress on a shared host and forget about it. Today that model survives only if there is someone taking care of the maintenance – timely updates, well-curated plugins, robust passwords, offsite backups, monitoring. Without it, the site does not get attacked in a targeted way: it simply gets consumed by background pressure, like a cliff that erodes without any particular wave breaking on it.

The second is centralisation. The industry's response to the problem has been “managed everything”: Cloudflare in front of everything, managed WAFs, hosting that does automatic protection, CDNs that absorb anomalous traffic. They work. But the price is that a large chunk of the web now passes through a single provider – Cloudflare handles something like 20% of global HTTP requests – and the small independent publisher who would like to remain small and independent has to choose between delegating their network to a commercial intermediary or accepting standing upright in the wind.

On the defensive front there is a ferment of countermeasures – creative and desperate at the same time. Beyond Anubis, there are tar pits: Nepenthes, written by an anonymous developer who signs himself “Aaron”, responds to crawlers with infinite labyrinths of generated content – pages that link to other pages that link to others, all synthetic, all designed to consume the bot's resources without giving anything useful in return. Cloudflare has released a commercial equivalent, AI Labyrinth, which does the same thing serving irrelevant text to recognised crawlers. There is the community project ai.robots.txt, which maintains an up-to-date list of AI crawler user-agents and provides both a ready-made robots.txt and .htaccess rules to block them. A small archipelago of individual countermeasures – effective in some cases, but also a symptom: the fight is site by site, sysadmin by sysadmin, because no higher level exists where the question can be resolved.

Self-hosting is still possible. I do it myself, many others do. But it requires time, competence, continuous attention. It has become a niche. What in the 1990s was the normal way of being online is today an exception that needs to be justified – and maintained by hand.

We publish for human readers. But the infrastructure is shaped by bots. The visible web – the one humans see, navigate, read – is the surface tip of an iceberg made mostly of traffic invisible to the eyes and visible in the logs. The real web – the one the bots see – is all of IPv4, scanned in search of usable surfaces.

Guests on our own web

When Tim Berners-Lee described the World Wide Web in the early 1990s, he spoke of a space for connecting people: documents, ideas, knowledge, communities. The cyberlibertarian narrative of the years that followed – Barlow's Declaration of the Independence of Cyberspace in 1996, the Californian dream of the internet as individual emancipation from the hierarchies of the twentieth century – amplified that promise until it became myth. Thirty years later, the data is one: in 2025, humans are 47% of internet traffic. The majority is machines. And 80% of the work of those machines is the extraction of value from pages that other humans have written, to be processed and sold as predictive, classificatory, generative capability.

Lawrence Lessig saw it in 1999, in Code and Other Laws of Cyberspace. The thesis was simple: code is law. The technical architecture of a network is already political, because it determines what behaviours are possible. Changing the code – the protocols, the specifications, the design choices – means changing which practices are economic and which are not. TCP/IP does not speak about identity, and that is a political choice with thirty-year consequences. robots.txt was cooperative, and that is a political choice that has become a vulnerability. Those who have controlled the architecture – the ARPANET engineers first, the large infrastructure companies later – have already written the rules of the game, regardless of who won the elections or wrote the laws. Lessig has been repeating it for twenty-five years. It is happening now, on a global scale.

We are guests on our own web. We have been for at least a decade, and for two years we have been statistically a minority. The rent we pay is in data extracted without our noticing, in attention consumed by content generated by those who have scraped ours, and in administration hours spent keeping in place infrastructure that is not designed for us. It is not a metaphor: it is an accounting that could be done line by line, if anyone felt like keeping it. The interesting question, then, is not how we block the bots: it is what it means to publish and administer in an internet where the intended audience is no longer the majority of the recipients. A question we should have asked ourselves a long time ago, and one that concerns not only technical operators, but anyone who considers the internet a common good – political, cultural, material.

Sources and further reading

On bot traffic statistics and trends

Imperva (Thales) (2025). 2025 Bad Bot Report: The Rapid Rise of Bots and the Unseen Risk for Business. Twelfth annual edition. The decade-long historical series, the pillar 51% figure, composition by attack category, estimates on residential proxies and ATOs. Thirteen trillion bot requests blocked in 2024. https://www.imperva.com/resources/resource-library/reports/bad-bot-report/
Imperva (Thales) (2026). 2026 Bad Bot Report: Bad Bots in the Agentic Age. Updated figures for 2025: 53% bots, 47% humans. https://www.imperva.com/blog/
Cloudflare Radar (2025). 2025 Year in Review: The rise of AI, post-quantum, and record-breaking DDoS attacks. AI crawler composition by purpose (training/search/user action), GPTBot/ClaudeBot/Meta-ExternalAgent share, crawl-to-refer ratio by platform. Independent confirmation of the Imperva data from a completely different network angle. https://radar.cloudflare.com/year-in-review/2025
Cloudflare Blog (2025). From Googlebot to GPTBot: who's crawling your site in 2025. https://blog.cloudflare.com/

On the breakdown of cooperative protocols

DeVault, D. (2025). Please stop externalising your costs directly into my face. SourceHut blog, March 2025. The manifesto, in first person, of a sysadmin who watches the cooperative robots.txt pact break. Essential reading to understand what it means to administer a FOSS service under pressure from LLM crawlers. https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
Wikimedia Foundation (2025). How crawlers impact the operations of the Wikimedia projects. Diff blog, April 2025. The internal data: 65% of the most expensive traffic from bots, 35% of pageviews. The most documented case of asymmetry between costs borne by the body hosting free content and benefits extracted by crawlers. https://diff.wikimedia.org/
Iaso, X. (2024–present). Anubis (proof-of-work anti-AI-scraper). The concrete tool that GNOME, KDE and several other FOSS communities have adopted to defend public infrastructure from aggressive crawlers. Demonstrates that defence, today, is proof-of-work – that is, computational friction applied to those who want to read. https://anubis.techaro.lol/

On scanning infrastructure

Durumeric, Z., Adrian, D., Mirian, A., Bailey, M., Halderman, J. A. (2015). “A Search Engine Backed by Internet-Wide Scanning”. Proceedings of the 22nd ACM Conference on Computer and Communications Security (CCS '15). Founding paper of Censys. Describes how scanning IPv4 has become economically trivial. Essential technical reading to understand the discovery/defence asymmetry. https://zmap.io/
Akamai (various years). The Web Scraping Problem and related Threat Intelligence reports. Economic model of the malicious bot as a parasitic long stay, not as a destroyer. Demolishes the common intuition that a site that falls over has been “attacked”: those who know how to attack well do not make anything fall over. https://www.akamai.com/blog/security

On the political economy of digital infrastructure

Lessig, L. (1999, updated as Code v2 in 2006). Code and Other Laws of Cyberspace. Basic Books. Code is law. The technical architecture of a network is already political because it defines what is possible. Twenty-five years later, the thesis is the single most useful conceptual tool for reading what is happening to robots.txt. http://codev2.cc/
Zuboff, S. (2019). The Age of Surveillance Capitalism. PublicAffairs. Framework of non-consensual extraction as the dominant economic model of Silicon Valley. To be read thinking that its thesis, written about behaviour, applies today one level deeper: to the textual raw material.
Crawford, K. (2021). Atlas of AI. Yale University Press. The materiality of AI as extractive asymmetry: mines, datacentres, underpaid human labour. I would add: your server.

Original protocol specifications

Postel, J. (ed.) (1981). Internet Protocol. RFC 791. The original IP specification, fourteen pages that never talk about identity. https://datatracker.ietf.org/doc/html/rfc791
Koster, M., Illyes, G., Zeller, H., Sassman, L. (2022). Robots Exclusion Protocol. RFC 9309. The formal specification of robots.txt, arriving thirty years after the practice and already obsolete in the practice. Worth rereading every so often to remember that today's internet is a palimpsest of hacks on top of a protocol conceived for a world that no longer exists. https://datatracker.ietf.org/doc/html/rfc9309

Discuss...

#Bots #AICrawlers #robotsTxt #DigitalSovereignty #SelfHosting #Cloudflare #SurveillanceCapitalism #FOSS #Internet #SolarPunk #Writing

· 🦣 Mastodon · 📸 Pixelfed · 📬 Email · · ☕ Support this work on Liberapay

aicrawlers — jolek78's blog

Guests on our own web

The threshold: 51% (and already 53%)

Who is talking in this network?

robots.txt, or the death of a social pact

What it means for those who publish

Guests on our own web

Sources and further reading