ai — jolek78's blog

Anna's Archive: Robin Hood of knowledge or

Mon, 29 Dec 2025 14:24:51 +0000

3:00 AM. Another one of those nights where my brain decided sleep was overrated. After my usual nocturnal walk through the streets of a remote Scottish town—where even a fox observed me with that “humans are weird” look—I sat back down at my server. Just a quick scan of my RSS feeds, I told myself, then I can start work. When...

We backed up Spotify (metadata and music files). It's distributed in bulk torrents (~300TB), grouped by popularity. This release includes the largest publicly available music metadata database with 256 million tracks and 186 million unique ISRCs. It's the world's first “preservation archive” for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space), with 86 million music files, representing around 99.6% of listens.

The news came from Anna's Archive—the world's largest pirate library—which had just scraped Spotify's entire catalog. Not just metadata, but also the audio files. 86 million tracks, 300 terabytes. I stopped to reread those numbers, then thought: holy shit, how big is this thing?

And so, while the rest of the world slept, I started digging. This is one of those stories that needs to be told—a story weaving together hacker idealism, technology, billions of dollars in AI training data, and an ethical paradox few want to truly confront.

When Z-Library fell

November 3, 2022. The FBI seized Z-Library's domains, one of the world's largest pirate libraries. Two alleged operators were arrested in Argentina. The community panicked—Z-Library served millions of students, researchers, and readers. And suddenly, everything vanished.

But someone was prepared. A group called PiLiMi (Pirate Library Mirror) had created complete backups of all shadow libraries for years. LibGen, Z-Library, Sci-Hub. Everything. When Z-Library fell, these backups were ready. But there was a problem: petabytes of unusable data with no way to search them.

Enter Anna Archivist—a pseudonym, probably a collective—who understood something fundamental: preserving data is useless if it's not accessible. Days after Z-Library's seizure, Anna's Archive was online with a meta-search engine aggregating all shadow library catalogs, making them searchable and—crucially—virtually impossible to censor.

The numbers

December 2025:

61.3 million books (PDF, EPUB, MOBI, DjVu)
95.5 million academic papers
256 million music tracks (Spotify metadata)
86 million audio files (~300TB)
Total: ~1.1 Petabyte in unified torrents

To put this in perspective: the sum of all academic knowledge produced by humanity, plus a gigantic slice of world literary production, plus now music. All indexed, searchable, downloadable. Free. And virtually impossible to shut down.

Why it can't be killed

Remember Napster? Centralized servers, one lawsuit, shut down in a day. BitTorrent learned from that—decentralized everything. But Anna's Archive goes further, combining layers of resilience that make it practically immortal:

Distributed Frontend: Multiple domain mirrors (.li, .se, .org, .gs), Tor hidden service, Progressive Web App that works offline. Block one, others continue.

Distributed Database: Elasticsearch + PostgreSQL + public API. Anyone can download the entire database and host their own instance. No central server to attack.

Distributed Files: This is the genius part. Anna's Archive hosts almost nothing directly. Instead:

IPFS (InterPlanetary File System): Files identified by cryptographic hash, served by volunteer nodes worldwide
BitTorrent: Classic torrents with multiple trackers, self-sustaining swarms
HTTP Gateways: For normal users who just want to click-and-download, links redirect to public IPFS gateways

Result: user downloads via normal HTTP, but content comes from a decentralized network. Can't shut down IPFS. Can't stop BitTorrent. Can block gateways, but hundreds exist and anyone can create new ones.

OpSec: Domains registered via privacy-focused Icelandic registrar, bulletproof hosting in non-cooperative jurisdictions, Bitcoin payments, PGP-encrypted communications, zero personal information.

The only way to stop Anna's Archive would be to shut down the internet. Or convince every single seeder to stop. Good luck.

81.7 terabytes free for meta

And here's where it gets disturbing.

February 2025. Documents from Kadrey v. Meta are unsealed—a class action by authors against Meta for using their pirated books to train Llama AI models. Internal emails reveal a shocking timeline:

October 2022 – Melanie Kambadur, Senior Research Manager:

I don't think we should use pirated material. I really need to draw a line there.

Eleonora Presani, Meta employee:

Using pirated material should be beyond our ethical threshold. SciHub, ResearchGate, LibGen are basically like PirateBay... they're distributing content that is protected by copyright and they're infringing it.

January 2023 – Meeting with Mark Zuckerberg present:

[Zuckerberg] wants to move this stuff forward, and we need to find a way to unblock all this.

April 2023 – Nikolay Bashlykov, Meta engineer:

Using Meta IP addresses to load through torrents pirate content... torrenting from a corporate laptop doesn't feel right.

2023-2024: The Operation

Meta downloaded:

81.7 TB via Anna's Archive torrents (35.7 TB from Z-Library alone)
80.6 TB from LibGen
Total: ~162 TB of pirated books

Method: BitTorrent client on separate infrastructure, VPN to obscure origin, active seeding to other peers. Result: 197,000 copyrighted books integrated into Llama training data.

June 2025: the ruling

Judge Vince Chhabria (Northern District California) applied the four-factor fair use test. The decision is legally fascinating and ethically disturbing.

Factor 1 – Transformative Use: Meta wins decisively. The judge ruled AI training is “spectacularly transformative”—fundamentally different from human reading. The purpose isn't to express the content but to learn statistical relationships between words.

Factor 2 – Nature of Work: Neutral. Creative fiction gets more copyright protection than factual works, but this didn't tip the scales either way.

Factor 3 – Amount Used: Meta wins. Even though they used entire books, the judge found this necessary for training. You can't cherry-pick sentences and expect an AI to learn language patterns.

Factor 4 – Market Effect: This is where the judge's discomfort shows through:

Generative AI has the potential to flood the market with endless amounts of images, songs, articles, books... So by training generative AI models with copyrighted works, companies are creating something that often will dramatically undermine the market for those works, and thus dramatically undermine the incentive for human beings to create things the old-fashioned way.

He sees the problem clearly. AI trained on copyrighted works will compete with and potentially destroy the market for those very works. But the plaintiffs couldn't prove specific economic harm with hard data.

The final ruling: “Given the state of the record, the Court has no choice but to grant summary judgment.” Meta wins on these specific facts. But the judge adds a critical caveat: “In most cases, training LLMs on copyrighted works without permission is likely infringing and not fair use.”

Meta didn't win because what they did was legitimate. They won because the authors' lawyers didn't build a strong enough evidentiary case. It's a technical legal victory that sidesteps the ethical question entirely.

The precedent this sets is chilling: AI companies can pirate with relative impunity if they have good lawyers and plaintiffs can't prove specific damages.

The math

Scenario A (legal):

Meta negotiates licenses with publishers
Cost: $50-100 million (conservative estimate)
Authors receive royalties

Scenario B (what they did):

Download 81.7 TB for free
Legal defense: ~$5 million
Win in court
Authors receive: $0

Meta's savings: $45-95 million

And now every AI company knows: download from Anna's Archive, risk a lawsuit with weak evidence, save tens of millions.

Anna's Archive also revealed they provide “SFTP bulk access to approximately 30 companies”—primarily Chinese LLM startups and data brokers—who contribute money or data. DeepSeek publicly admitted using Anna's Archive data for training. No consequences in Chinese jurisdiction.

Aaron Swartz and the question that haunts this story

There's a ghost here. His name is Aaron Swartz, and his story illuminates everything wrong with how we treat information access.

2011: Aaron, 24, brilliant programmer, Reddit co-founder, and information freedom activist, connected to MIT's network and downloaded 4.8 million academic papers from JSTOR. His intent was to make publicly-funded research freely available. He wasn't enriching himself. He was acting on principle.

The response was swift and brutal. Federal prosecutors threw the book at him: 13 felony charges, maximum penalty of 50 years in prison and $1 million in fines. For downloading academic papers. The prosecution was led by U.S. Attorney Carmen Ortiz, who called it “stealing is stealing, whether you use a computer command or a crowbar.”

The pressure was immense. Aaron faced financial ruin, decades in prison, complete destruction of his life. In January 2013, at age 26, he hanged himself. His family and partner blamed the aggressive prosecution. The internet mourned a brilliant mind and passionate advocate crushed by prosecutorial overreach.

Now consider the parallel:

Aaron Swartz: 4.8 million papers → federal persecution, suicide at 26

Meta: 162 TB (~162 million papers) → wins in court, saves $95 million

Aaron was an individual acting on idealistic principles about information freedom. Meta is a trillion-dollar corporation acting on profit motives. Aaron faced the full weight of federal prosecution. Meta faced a civil lawsuit they successfully defended with their massive legal team.

The system punishes idealism and rewards profit. The disparity isn't just unjust—it reveals something fundamental about who gets to break rules and who doesn't.

The paradox no one wants to see

Anna's Archive claims to fight publishing monopolies and inequality in access to knowledge. But the reality:

Who benefits most?

Meta: 81.7 TB free, $95M saved
~30 AI companies: privileged access
Corporations with $100M+ compute budgets

Resources needed to benefit:

Storage/Bandwidth: trivial for Meta ($1000s)
Computing for training: MASSIVE ($10-100M)
Legal defense: MASSIVE ($millions)

Only big tech can afford this. The result:

Data: socialized (Anna's Archive, shared risk)
Profits: privatized (proprietary LLMs, paid APIs)
Costs: externalized (authors not compensated)

But what about students in the Global South?

This is where the story gets complicated, because the benefits are real and they matter immensely.

Consider a medical student in India. Her family earns about $400/month. A single medical textbook costs $300-500. She needs fifteen of them. The math is impossible. Her options: don't graduate, or Anna's Archive. She chose the latter and completed her degree. She's now a practicing physician.

Or take a PhD researcher in South Africa studying climate change impacts. The critical papers for his dissertation are behind Elsevier's paywall at $35 each. He needs twenty papers minimum—$700 his university can't afford. Without Sci-Hub (accessible through Anna's Archive), his dissertation would have been impossible. He completed it, published findings that inform local climate policy.

An art history teacher in Argentina wanted to enrich her curriculum with Renaissance art analysis. The books she needed weren't available in local libraries. Importing them? Prohibitive between shipping costs and customs. Anna's Archive gave her access to rare texts that transformed her teaching.

The data backs this up: literature review times for researchers in developing countries reduced 60-80%. Citation patterns show researchers in Nigeria, Bangladesh, Ecuador now cite contemporary research at parity with Harvard and Oxford. Publications from developing countries have increased. Methodological quality has improved. International collaborations have expanded.

This matters. This changes lives. This is not hypothetical.

The problem is: both things are simultaneously true.

Anna's Archive saves academic careers in the Global South
Anna's Archive allows Meta to save $95 million

But Meta downloaded more data in one week than all Indian students download in a year. How do we square that?

The broken system that created this monster

To understand why Anna's Archive exists and why it's grown so explosively, you need to understand how fundamentally broken academic publishing has become.

Here's the perverse cycle:

Researcher writes paper (unpaid)
Other researchers peer review it (unpaid)
Publisher publishes it
Researcher's own university must pay to read it
Publisher profits: Elsevier and Wiley report 35-40% profit margins

Today, over 70% of academic papers sit behind paywalls. Access costs $35-50 per paper for individuals, or $10,000-100,000+ per year for institutional subscriptions. Universities in developing countries simply cannot afford these subscriptions. Neither can most universities in developed countries—Harvard famously called journal subscription costs “fiscally unsustainable” in 2012.

The system extracts free labor from researchers, locks up publicly-funded research behind paywalls, charges exorbitant fees to access it, and funnels enormous profits to publishers who add relatively little value. Academic institutions create the knowledge, do the quality control, and then pay again to access their own work.

Sci-Hub and Anna's Archive didn't emerge from nowhere. They're responses to a genuinely broken system. The question is whether they're the right response—and who ultimately benefits most from that response.

The architecture determines the ethics

Anna's Archive can't discriminate because:

Open source philosophy: everyone or no one
Technical impossibility: how do you block Meta but not students?
Legal strategy: claiming “non-hosting” makes usage control impossible

IPFS and BitTorrent are magnificent tools for resisting censorship. But resistance to censorship also means resistance to ethical control. You can't have one without the other.

The system is structurally designed to be unkillable. Which also means it's structurally designed to serve whoever has the resources to benefit most.

Where does it end?

December 2025: Anna's Archive announced they'd scraped Spotify. The same preservation narrative, the same pattern. 256 million tracks, 86 million audio files, 300TB available to anyone with the infrastructure to use it.

“This Spotify scrape is our humble attempt to start such a 'preservation archive' for music,” they wrote. The justification mirrors the books argument: Spotify loses licenses, music disappears; platform risk if Spotify fails; regional blocks prevent access; long tail poorly preserved.

All true. But who downloads 300TB of music? Not the kid in Malawi who just wants to listen to his favorite artist. ByteDance, training the next AI music generator. Startups building Spotify competitors. The same companies with compute budgets in the tens of millions.

Anna's Archive is pivoting from text to multimedia, and each escalation follows a predictable pattern:

Books → Justified by paywalls and academic access
Papers → Justified by broken academic publishing
Music → Justified by platform risk and preservation
Video? → What's the justification for the next step?

With each escalation:

The value for big tech increases exponentially
The proportion of benefit for individual students decreases
Mass piracy becomes normalized as “preservation”
The ethical questions get harder to answer

And the international precedent is already being set. Japan's AI Minister (January 2025) stated explicitly: “AI companies in Japan can use whatever they want for AI training... whether it is content obtained from illegal sites or otherwise.”

The message from governments: pirate freely if it serves AI supremacy. We're in a race to the bottom where copyright becomes meaningless for AI training, and the companies with the most resources benefit most.

Conclusions: I don't know which way to turn

I started from that sleepless night, 256 million songs in an RSS feed, and ended up here with more questions than answers.

Anna's Archive is a technological marvel—IPFS, BitTorrent, distributed databases creating something genuinely uncensorable. It's also a lifeline for millions of students and researchers locked out of knowledge by an exploitative publishing system. And simultaneously, it's the largest intellectual property expropriation operation in history, saving corporations hundreds of millions while creators receive nothing.

All of these things are true at once. This isn't a simple story with heroes and villains.

The academic publishing system is genuinely broken. Researchers create knowledge for free, review it for free, then their institutions must pay exorbitant fees to access it while publishers extract 35-40% profit margins. This system deserves to be disrupted.

But Anna's Archive isn't disrupting it equitably. The architecture that makes it uncensorable also makes it impossible to distinguish between a student in Lagos accessing a textbook and Meta downloading 162TB for AI training. You can't have selective resistance to censorship—it's all or nothing.

Aaron Swartz died fighting for information freedom with idealistic principles. Meta achieves the same result with corporate profit motives and walks away victorious. The system rewards power and punishes principle.

Can this be fixed? Copyright reform moves at the speed of politics—years, decades. Compulsory licensing for AI training? Just beginning to be discussed. Open access mandates? Facing massive publisher resistance. Meanwhile, Anna's Archive operates at the speed of software, and data flows freely to those with $100M compute clusters.

The question isn't whether Anna's Archive will be stopped—it won't be, that's the point of the architecture. The question is what world we're building where the same technology that liberates a medical student in India also bankrolls Meta's AI ambitions, and we can't separate one from the other.

I don't have answers. I have a functioning IPFS node, a Tor relay, and the uncomfortable knowledge that every byte I help distribute might be saving a researcher's career or training someone's proprietary AI model. Probably both.

Free for everyone. The problem is that “everyone” has very different resources to benefit from that freedom.

Now, if you'll excuse me, I'm going to check how much bandwidth my nodes are using. And reflect on whether participation is complicity or resistance. Maybe it's both. Maybe that's the point.

Discuss...

#AnnaArchive #AI #Copyright #AaronSwartz #Meta #AcademicPublishing #IPFS #InformationFreedom #Writing

· 📝 Content shared under CC BY-SA 4.0 · · 🦣 Mastodon · 📸 Pixelfed · 📬 Email · · ☕ Support this work on Liberapay

ChatGPT didn't invent anything.

Tue, 28 Oct 2025 12:56:35 +0000

When the world woke up astonished in November 2022 to this “magical” chatbot, few realized that this magic was the result of decades of research. The history of artificial intelligence begins in 1943, when Warren McCulloch and Walter Pitts proposed the first mathematical model of an artificial neuron. In 1956, at the Dartmouth Conference, John McCarthy coined the term “Artificial Intelligence” and the discipline was officially born.

The '60s and '70s were characterized by excessive optimism: people thought strong AI was just around the corner. Two “AI winters” followed – periods when funding disappeared and research slowed – because promises weren't materializing. But some continued working in the shadows. Geoffrey Hinton, Yann LeCun, Yoshua Bengio – those we now call the “godfathers of deep learning” – continued their studies on neural networks when no one believed in them anymore.

The real breakthrough came with three converging factors: computational power (GPUs), enormous amounts of data, and better algorithms. In 2012, AlexNet won the ImageNet Challenge by an overwhelming margin, demonstrating that deep learning really worked. From there, an unstoppable acceleration.

Once upon a time in the Carboniferous...

Before ChatGPT exploded, my only knowledge of AI came from science fiction books. Philip K. Dick and his reflections on what it means to be human. Cyberpunk in general, with its technological dystopias. Gibson's Sprawl trilogy, where AIs live in cyberspace like digital deities. Those pages were my only window to a future that seemed incredibly distant.

When I hosted the podcast Caccia al Fotone (a nice thing, but now belonging to the Carboniferous period...), I delved deeper into the subject. I read several papers published on arXiv and dedicated two episodes to AI development. In 2019, during the pandemic period, I devoured “Artificial Intelligence: A Guide for Thinking Humans” by Melanie Mitchell – a book that also helped me write a “thing” (those who know, know; those who don't, never mind...) on the evolution of computer systems and surveillance capitalism.

I thought I had a clear picture. I thought I was prepared.

Mea culpa

Then ChatGPT arrived.

November 2022. First approach: total amazement. I couldn't believe my eyes. I kept asking questions, and despite all the initial hallucinations I encountered, I continued to have that “wow effect” typical of a child finding the most beautiful shell on the seashore (forgive me Newton for stealing that phrase, but it's always too beautiful).

And here's my mea culpa: I set aside all my protective filters that I generally have regarding privacy, open source, control over my data. I let myself go for hours of conversations on the most diverse topics. Until one night – one of many sleepless nights – I found myself discussing with that LLM about depression, various mental disorders, and how one or more abuses can influence a person's life.

When I realized what was happening, I stopped abruptly. I deleted the conversation, canceled my OpenAI subscription and didn't touch any LLM for more than a month. I was entrusting my most intimate thoughts to a proprietary system controlled by a corporation. I was betraying every principle I believed in.

But I work in IT. This is a huge revolution. I couldn't afford to fall behind, nor could I simply reject it on principle. I had to find an alternative. I began to study seriously.

Local, always local

I encountered the first models I could test locally. I discovered Hugging Face, and it was like finding an oasis in the desert. I began studying transformers, the datasets developed by the community. And I was astounded.

Transformers are the architecture that revolutionized AI. Presented in the 2017 paper “Attention Is All You Need”, they replaced old recurrent neural networks (RNNs) with a more elegant and efficient mechanism: the attention mechanism.

In simple words: instead of processing text word by word in sequence, a transformer looks at all words simultaneously and calculates which ones are most relevant to the context. When you read “The bank of the river was green,” the attention mechanism understands that “bank” refers to the river and not the financial institution, because it evaluates the weight of each word relative to the others.

This architecture made models like BERT, GPT, and all modern LLMs possible. It's scalable, parallelizable, and extremely powerful.

Hugging Face and the Open Source revolution

Hugging Face is much more than a platform: it has become the Library of Alexandria of the artificial intelligence era. Founded in 2016, it now hosts over 500,000 pre-trained models, 250,000 datasets, and thousands of demo applications.

Their transformers library has democratized access to AI. With a few lines of Python you can download and use models that would cost millions of dollars to train from scratch. Hugging Face isn't the only platform doing this – there are also Ollama, LM Studio, GPT4All – but it's certainly the most extensive and collaborative.

Here, praise must be given to the developers: this community of people scattered around the world is doing extraordinary work. They release open source models, share knowledge, meticulously document everything. They're building a real alternative to Big Tech's monopoly on AI.

History repeating

Watching this explosion of open models, global collaboration, shared code, I had a powerful déjà-vu. This is incredibly similar to the open source revolution that happened 30 years ago.

In the '90s, Linux and the free software movement challenged Microsoft's dominance and proprietary systems. Many said it was impossible, that free software would never work. Today Linux powers 96% of the world's servers, all Android smartphones, and much of the Internet infrastructure.

Now the same thing is happening with AI. Llama, Mistral, Falcon, Mixtral – “open weight/open source” models that compete with (and often surpass) their proprietary counterparts. History repeats itself, and this time I know which side to be on.

Another server in my homeLab

I resumed studying Python, a study I had left on standby years ago. I began experimenting with training local LLM models. I added old scripts to provide my writing style (yes, it seems incredible but every coder has their own style, and it says a lot about their personality). I used Llama 3 to improve my Bash coding.

And when I was ready, I decided to make an important purchase: I bought a small server – to add to my homelab: Proxmox, pfSense, Nextcloud, WireGuard etc... – that I would transform into an OpenWebUI system.

OpenWebUI is a self-hosted web interface for local language models. Like ChatGPT, but running entirely on local hardware, without sending a single byte to someone else's servers.

For the nerds reading: the simplest way to install is obviously through Docker. Here's a basic example:

docker run -d -p 3000:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Once installed, just connect OpenWebUI to Ollama (the runtime for local models), download your preferred models, and you're operational.

GPU usage is fundamental: a medium-sized LLM requires a lot of RAM and computing power. A dedicated GPU (like an NVIDIA GTX of various types) makes an enormous difference. For those using AMD, there's ROCm. With 16GB of RAM and an 8GB GPU, you can comfortably run 7B parameter models quantized to 4-bit.

My favorite combo? AMD, Debian, Docker, OpenWebUI, Ollama and Mistral.

A revolution. and a choice to make

We're facing a revolution that we cannot avoid. AI is here, it's powerful, and it's evolving rapidly. There are two roads ahead of us.

The first: avoid it now, close our eyes, hope it passes or that someone else deals with it. And then, in twenty years, find ourselves chasing an evolved AI, probably impossible to understand, completely in the hands of those who controlled it from the beginning. This is the path of least resistance, but also of maximum risk. It means ceding control, understanding, and ultimately power to whoever gets there first.

The second: study it, analyze it, use it and understand it today to be able to handle it better tomorrow. Actively participate in its evolution. Contribute to the open source community, ensure that this technology remains accessible, understandable, in the hands of many instead of a few. This path requires effort, time, sometimes admitting we were wrong (as I did). But it's the only path that leads to actual agency over our technological future.

The choice seems obvious when stated this way, but it's not easy in practice. It requires overcoming fear, investing time, challenging our assumptions. It means getting our hands dirty with code, running models locally, understanding how these systems actually work instead of treating them as black boxes.

I made my choice that night when I deleted my ChatGPT conversation history. I chose not to be a passive consumer of AI technology controlled by corporations. I chose to understand, to build, to contribute to the alternative that's being constructed by thousands of developers around the world.

The technology is already here. The question is: will it be controlled by a few companies optimizing for profit and control, or will it be a tool accessible to everyone, understandable, modifiable, improvable by the community?

As I've learned on this journey, choosing to understand – even when it's difficult, even when it means admitting you were wrong – is always better than passively submitting.

AI is not magic. It's mathematics, code, hardware, and above all: it's made by people. And if it's made by people, it can be understood, modified and shaped by people. For the better, not for the worse.

The revolution is happening. The only question is: are you participating, or are you watching?

#AI #OpenSource #LocalLLM #Privacy #ChatGPT #HuggingFace #Ollama #SelfHosted #MachineLearning #DigitalSovereignty #Writing

Discuss...

· 📝 Content shared under CC BY-SA 4.0 · · 🦣 Mastodon · 📸 Pixelfed · 📬 Email · · ☕ Support this work on Liberapay