Libraries have been trying to collect humanity’s knowledge almost since the invention of writing. In the digital age, it might actually be possible to create a comprehensive collection of all human writing that meets certain criteria. That’s what shadow libraries do - collect and share as many books as possible.
One shadow library, Anna’s Archive (which I will not link here directly due to copyright concerns), recently posed a question: How could we effectively visualize 100,000,000 books or more at once? There’s lots of data to view: Titles, authors, which countries the books come from, which publishers, how old they are, how many libraries hold them, whether they are available digitally, etc.
- https://phiresky.github.io/blog/2025/visualizing-all-books-i...
Basically, legally gray online book repositories such as Anna's Archive, who was the creator of this bounty, are trying to collect a lot of books. The question quickly arises - how many books are there?
The best way to track books is by using ISBN, international standard book number, basically the personal id of any given books, given to books by an international agency. Now that you know which books exist, you can check which books your repository already has and which ones are missing.
But ISBN covers the space of over 2 billion possible existing books. That's a lot. So, Anna's Archive has created a contest to display this space in the cleanest way possible. The winning submission is very nicely done, and in my view very well deserving of the 6,000$ bounty.
There are multiple ways to look at this, but for example, my middle European country's laws explicitly state that breaking copyright is okay, if the material is used for teaching purposes. Downloading for personal use is also allowed.
Are they breaking the laws of the country where they host their own data? I can't really say.
In honesty, I don't believe copyright laws will survive this decade, much less this century. With models being trained on copyrighted material and no cases setting the precendent that this is not okay, I feel like the new reality is that you can steal anything, as long as you 'launder' it through an AI model.
Maybe that may be the next big startup, re-creating copyrighted books through AI models, just different enough to skirt the laws. Who wouldn't like to read 'Owner of Numerous Pieces of Jewelery' instead of 'Lord of the Rings'?
There are places that have a minimal or no formal recognition of IP rights. Not counting stateless or breakaway regions like Transnistria and Sealand, countries like Somalia and South Sudan either do not have a government-run IP system, or in the case of South Sudan are not part of the Berne Convention. I doubt that Anna's Archive operates in one of these places, but there are still safe harbors for their mission.
Ok so from what I understood, this visualisation displays all the ISBNs that are assigned into countries, then across publishers. Books that are not highlighted are the ones that are not present on Annas Archives? Is that so?
Annas Archive has both books in their archive, but they also have other datasets that connect a book ISBN to the metadata (title, author, publisher, ...).
In my visualisation https://isbnviz.pages.dev you can see which books they actually have the files of (blue) and which ones they know exist because they have the metadata from some other source (like google books, ...) (red). Finally, there are also ISBNs not contained in any of the sets that Annas Archive has, and these are either assigned or not assigned. A lot of the 979 prefixed ISBNs are not assigned, that means, no country/publisher has the right to assign them to a book. Other ISBNs are assigned to a publisher, but they just haven't published a book with that ISBN yet. Or they may have published a book, but Anna's archive doesnt know about the book because its not in their (or the ones they scraped) dataset.
Libraries have been trying to collect humanity’s knowledge almost since the invention of writing. In the digital age, it might actually be possible to create a comprehensive collection of all human writing that meets certain criteria. That’s what shadow libraries do - collect and share as many books as possible.
One shadow library, Anna’s Archive (which I will not link here directly due to copyright concerns), recently posed a question: How could we effectively visualize 100,000,000 books or more at once? There’s lots of data to view: Titles, authors, which countries the books come from, which publishers, how old they are, how many libraries hold them, whether they are available digitally, etc. - https://phiresky.github.io/blog/2025/visualizing-all-books-i...
Basically, legally gray online book repositories such as Anna's Archive, who was the creator of this bounty, are trying to collect a lot of books. The question quickly arises - how many books are there?
The best way to track books is by using ISBN, international standard book number, basically the personal id of any given books, given to books by an international agency. Now that you know which books exist, you can check which books your repository already has and which ones are missing.
But ISBN covers the space of over 2 billion possible existing books. That's a lot. So, Anna's Archive has created a contest to display this space in the cleanest way possible. The winning submission is very nicely done, and in my view very well deserving of the 6,000$ bounty.