I doubt any "anti-scraper" system will actually work.
But if one is found, it will pave the way for a very dangerous counter-attack: Browser vendors with need for data (i.e. Google) simply using the vast fleet of installed browsers to do this scraping for them. Chrome, Safari, Edge, sending the pages you visit to their data-centers.
This feels like it already was half happening anyway so it isn't to big of a leap.
I also think this is the endgame of things like Recall in windows. Steal the training data right off your PC, no need to wait for the sucker to upload it to the web first.
I've done that as well. The PoC worked, but the statelessness did prove a hurdle.
It enforces a pattern in which a client must do the PoW every request.
Other difficulties, uncoverd in our PoC were:
Not all clients are equal: this punishes an old mobile phone or raspberry-pi much more than a client that runs on a beefy server with GPUs or clients that run on compromised hardware. - I.e. real users are likely punished, while illegitimate users often punished the least.
Not all endpoints are equal: We experimented with higher difficulties for e.g. POST/PUT/PATCH/DELETE over GET. and with different difficulties for different endpoints: attempting to match how expensive a call would be for us. That requires back-and-forth to exchange difficulties.
It discourages proper HATEOAS or REST, where a client browses through the API by following links and encourages calls that "just include as much as possible in one query". Deminishing our ability to cache, to be flexible and to leverage good HTTP practices.
That's a perfect tool for monopolists to widen their moat even more.
In line with how email is technically still federated and distributed, but practically oligopolized by a handfull of big-tech, through "fighting spam".
That's not true for the vast amount of creative-commons, open-source and other permissive licenced content.
(Aside: the licenses and distribution advocated by many of the same demography (information wants to be free -folks, jstor protestors, GPL-zealots) that now opposes LLMs using that content. )
I'm sure GPL zealots would be happier about this situation if LLM vendors abided by the spirit of the license by releasing their models under GPL after ingesting GPL data, but we all know that isn't happening.
Your users - we, browsing the web - are already threatened with this. Adding a PoW changes nothing here.
My browser already has several layers of protection in place. My browser even allows me to improve this protection with addons (ublock etc) and my OSes add even more protection to this. This is enough to allow PoW-thats-legit but block malicious code.
WebGPU is experimental in Firefox all platforms, but especially on Linux. Chrome on linux should have it, but I've not gotten it to work - might be chromium, might be a flag, or something else.
Java bytecode was originally never intended to be used with anything other than Java - unlike WASM it's very much designed to describe programs using virtual dispatch and automatic memory management. Sun eventually added stuff like invokedynamic to make it easier to implement dynamic languages (at the time, stuff like Ruby and Python), but it was always a bit of round peg in square hole.
By comparison, WASM is really more like traditional assembly, only running inside a sandbox.
I think so, but that was the 90s where we needed a lot more hindsight to get it right. Plus that was mostly just Sun, right? WASM is backed by all browsers and it looks like MS might be looking at bridging it with its own kernel or something?
I understand why, but still lament that java applets where dropped like a hot potato, rather than solving the (fundamental) issues.
Back then, I learned Java, just to have fancy menus, quirky gimmicks and such. Until flash came along, nothing could do this. Where Java was rather open, free/libre, flash was proprietary and even patented. A big step back. And it took decades before JavaScript reached parity in possibilities to create such interactive, multimedia experiences in a cross-browserish way.
I can only imagine how much further along something like videoconferencing, realtime collaboration or gaming on the web would've been if this Java applet tech had been ever improving since inception.
(edit: I'm all for semantic, accessible ,clean HTML/CSS/JS in my web apps. But there's lots of use cases for gimmicks, fancy visuals, immersive experiences etc. and no, that's not the hotel-reservation-form or the hackers-forum. But art. Or fun. Or ?)
Sure, I just think that's a very odd way to characterize the project. Basically anything can be universal vm if you put enough effort to reimplementing the languages. Much of what sets Parrot aside is its support for frontend tooling.
I certainly think the humor in parrot/rakudo (and why they come up today still) is how little of their own self image the proponents could perceive. The absolute irony of thinking that perl's strength was due to familiarity with text-manipulation rather than the cultural mass....
So far I've only ever been using a private symbol that only exists within the codebase in question (and is then exported to other parts of said codebase as required).
If I ever decide to generalise the approach a bit, I'll hopefully remember to do precisely what you describe.
Possibly with the addition of providing an "I am overriding this deliberately" flag that blows up if it doesn't already have said symbol.
But for the moment, the maximally dumbass approach in my original post is DTRT for me so far.
I've used Codeberg for some projects and while their work and services are impressive and their progress steady and good, it's really not a proper alternative to Github for many use-cases.
"It depends", as always, but codeberg lacks features (that your use-case may not need, or may require), uptime/performance (that may be crucial or inconsequential to your use-case), familiarity (that may deter devs), integration (that may be time-consuming to build yourself or be unnessecary for your case) etc etc.
But if one is found, it will pave the way for a very dangerous counter-attack: Browser vendors with need for data (i.e. Google) simply using the vast fleet of installed browsers to do this scraping for them. Chrome, Safari, Edge, sending the pages you visit to their data-centers.
reply