Hacker News new | past | comments | ask | show | jobs | submit | dalke's comments login

To share a related river issue, a few years back the city of Gothenburg, Sweden switched how they allocate students to a school. It used to be the student would get a school in their specific section of the city ("statsdel"). The new one lets you choose a school, specified in rank order, even if it's in another part of the city.

You could select from a list of schools, with distances given as straight-line distance (or perhaps as route distance? I can't tell from the articles I've read), which meant some of the schools across the river were considered "close".

In one case, a student had a 45 minute commute to get to school, due to waiting for the ferry. The parents listed it as their 5th choice, based on the stated distance. The more critical factor should be travel time, but the computer system doesn't take mass transit schedules into account.

As a contributing factor, the list of schools did not include the name of the stadsdel in the list of schools. This extra sort of reverse geocoding - trivially available - might have helped them realize the issue.

One news story about it, in Swedish: https://www.svt.se/nyheter/lokalt/vast/kaos-i-antagningssyst...


I tried loading 2PLV (by id and by PDB file) and 9PTI (by id) but nothing happened.

LibreWolf on macOS, if it matters.


This 1981 Nova episode starts with Nixon's radio address about the American Right of Privacy. "A system that fails to respect its citizens' right to privacy fails to respect the citizens themselves."

It discusses cryptography, including public key cryptography, describes worries about the possibility for abuse, the use of personalized data for targeted political mailings and marketing, and the Minitel introduction on Saint-Malo. There's also a staged pen test and a kid who get caught breaking into a university computer.

One scene mentions how people are worried about the tracking possible by using ATM. The banking rep says that would require computers which are 1,000 times more powerful.


I see ChatGPT's bots pull down all of my Python wheels every couple of weeks.

Wheels that haven't changed in years, with a "Last-Modified" and "ETag" that haven't changed.

The only thing that makes sense to me is it's cheaper them to re-pull and re-analyze the data than to develop a cache.


Or, could it be, just possibly be (gasp), that some of the devs at these "hotshot" AI companies are just ignorant or lazy or pressured enough, so as to not do such normal checks? Wouldn't be surprised if so.


You think they do cache the data but don't use it?

For what it's worth, mj12bot.com is even worse. They pull down every wheel every two or three days, even though something like chemfp-3.4-cp35-cp35m-manylinux1_x86_64.whl hasn't changed in years - it's for Python 3.5, after all.


>You think they do cache the data but don't use it?

that's not what I meant.

and it is not they, it is it.

i.e. the web server, not bots or devs on the other end of the connection, is what tells you the needed info. all you have to do is check it and act accordingly, i.e. download the changed resource or don't download the unchanged one.

google:

http header last modified

and look for the etag link too.



I have a small static site. I haven't touched it in a couple of years.

Even then, I see bot after bot, pulling down about 1/2 GB per day.

Like, I distribute Python wheels from my site, with several release versions X several Python versions.

I can't understand why ChatGPT, PetalBot, and other bots want to pull down wheels, much less the full contents when the header shows it hasn't changed:

  Last-Modified: Thu, 25 May 2023 09:07:25 GMT
  ETag: "8c2f67-5fc80f2f3b3e6"
Well, I know the answer to the second question, as DeVault's title highlights - it's cheaper to re-read the data and re-process the content than set up a local cache.

Externalizing their costs onto me.

I know 1/2 GB/day is not much. It's well under the 500 GB/month I get from my hosting provider. But again, I have a tiny site with only static hosting, and as far as I can tell, the vast majority of transfers from my site are worthless.

Just like accessing 'expensive endpoints like git blame, every page of every git log, and every commit in every repo' seems worthless.


This is a pet peeve of Rachel by the Bay. She sets strict limits on her RSS feed based on not properly using the provided caching headers. I wonder if anyone has made a WAF that automates this sort of thing.


Given that they're actively trying to obfuscate their activity (according to Drew's description), identifying and blocking clients seems unlikely to work. I'd be tempted to de-prioritize the more expensive types of queries (like "git blame") and set per repository limits. If a particular repository gets hit too hard, further requests for it will go on the lowest-priority queue and get really slow. That would be slightly annoying for legitimate users, but still better than random outages due to system-wide overload.

BTW isn't the obfuscation of the bots' activity a tacit admission by their owners that they know they're doing something wrong and causing headaches for site admins? In the copyright world that becomes wilful infringement and carries triple damages. Maybe it should be the same for DoS perpetrators.


Just to clarify, my understanding is that she doesn't block user agent strings, she blocks based on IP and not respecting caching headers (basically, "I know you already looked at this resource and are not including the caching tags I gave to you"). It's a different problem than the original article discusses, but perhaps more similar to @dalke's issue.


I'm pretty sure:

"I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. ... it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all."

means that no one has managed it.


I don't understand the thing about the cache. Presumably they have a model that they are training, that must be their cache? Are they retraining the same model on the same data on the basis that that will weigh higher page ranked pages higher or something? Or is this about training slightly different models?

If they are really just training the same model, and there's no benefit to training it multiple times on that data, then presumably they could use a statistical data structure like https://en.wikipedia.org/wiki/HyperLogLog to be check if they've trained on the site before based on the Last-Modified header + URI? That would be far cheaper than a cache, and cheaper than rescraping.

I was also under the impression that the name of the game with training was to get high quality, curated training sets, which by their nature are quite static? Why are they all still hammering the web?


> such that they have restricted what a citizen can do with it

My grandfather, born in Canada and later naturalized as a US citizen, got his ham ticket back in the 1960s, but, as he wrote: "This was O.K. for one year but to renew & become general I would have to obtain more than just a US passport; It would be necessary to get a certificate of citizenship. This took years and during those years I landed up in the Dom. Republic & got my Ham ticket there without it, HI3XRD."

He later moved to Miami. When Hurricane David came through the D.R. in 1979, he was one of the ham volunteers who helped handle communications from the island.

Oh, and he never got Extra because while he could manage 13 wpm for General or Advanced, he couldn't manage the 20 wpm for Extra.


"It would be necessary to get a certificate of citizenship. This took years and during those years I landed up in the Dom. Republic & got my Ham ticket there without it, HI3XRD.""

Thank you very much for pointing that out. I'm in Australia and I've often pointed to the fact that many countries restricted access to the radio spectrum for many reasons—to limit EMI, for state security and strategic reasons, ensure secrecy of communications, etc.

For example, when I got my amateur ticket whilst still at school in the 1960s I had to sign a Declaration of Secrecy and have it witnessed by a registered JP. The reason was that people such as us could come across important transmissions (messages) of a strategic nature that should not be allowed to fall into the wrong hands.

Come mobile phones, WiFi etc. that changed without any real public discussion whatsoever.

What I find absolutely amazing is how—by sleight-of-hand—Big Tech sideslipped both very tight telephony and radiocommunications laws to violate say privacy on smartphones, and the fact that they've gotten away with it. The smartphone generation hasn't a clue about any of this stuff. Right, once the privacy of telephonic communications was inviolable, now it's a fucking joke.

On the matter of the declaration of secrecy, amateurs could possibly come across unencrypted telephonic communications, ship-to-shore etc., and as deemed secret, they (rightly) were not allowed to act on that information in any way, in fact jail-time penalties applied if laws were violated.

Incidentally, as my Declaration of Secrecy has never been rescinded I'm still bound by its conditions.


I found an error in a published version of TAOCP, 2nd edition I think, with improvements on Sieve of Eratosthenes. I was so excited, then found it was already listed in the errata.

I later got a check for identifying a minor issue with the early history of superimposed coding. I happen to have copies of the relevant patent case containing examples predating Mooers' randomized superimposed coding.

("Happened to" because I had visited the Mooers archive at the Charles Babbage Institute in Minnesota to research some of the early history of chemical information management. Mooers is one of the "fathers" of information retrieval, and in fact coined the term "information retrieval" at a chemistry conference.)


When people ask for examples, I point to a NYT report of a man in San Francisco whose young son had redness on his penis and complained about it feeling sore. The pediatrician asked for some photos to make a diagnosis online. Google flagged it as child porn and notified the police. The police said it wasn't, but Google declined to restore service.

"A Dad Took Photos of His Naked Toddler for the Doctor. Google Flagged Him as a Criminal.", https://www.nytimes.com/2022/08/21/technology/google-surveil...

In India, Google locked an engineer from Gujarat out of his Google account because it contained explicit content potentially involving child abuse or exploitation. The engineer believes it's because the account contain images of him as a child being bathed by his grandmother.

"HC notice to Google India after engineer loses access Gmail, Google Drive, and more over childhood photo labelled 'porn'" https://timesofindia.indiatimes.com/technology/tech-news/hc-...

I use these examples specifically because many in my government want "Chat Control", where snitchware scans messages for child porn and the like, and notifies the police. It will be full of false positives like these, especially if the scanning software continues to be built by puritanical American companies.

Another class is people who the US deems to be a security threat. How long will it be until the US extends its sanctions against the ICC by ordering Microsoft, Apple, Amazon, Oracle, and Google to shut down the accounts for the ICC and anyone involved in their genocide investigation, work and personal?


Here in Sweden, it took me about 5 months but I was able to get the school to send me info by email. They switched to an app-only system, and I have no smartphone.

Setting aside any issues related to privacy or US corporate control over my life, I'm one of the people who doesn't use a smartphone because the temptation to be online, at the drop of the hat, is too much to resist.

I compare it to being like someone who needs to lose weight, so keeps all chocolate out of the house, while everyone seems to expect me to have a luscious bar of high quality chocolate with me all the time, just sitting there, begging me to eat it.

There is an ongoing debate about smartphones at school, and the addiction and distraction they can be for kids. I think my strongest argument is that the addiction and distraction don't simply disappear for adults, and there was no way they were going to force me to get a smartphone.

I don't think that would work for those with a smartphone, but it's a crack keeping an alternative open.


I live in Sweden with two kids in school and can do everything through the web (except BankId authentication of course).


In January our city switched from a web portal to an app-only system.

Last fall I was working on getting a username+password access to the portal, since the students and teachers don't need BankID to log in. I was using my son's login to read the weekly newsletters.

At the time I warned our skolnämnden that I would not be getting a smartphone. They switched to app-only, making it impossible for me to get info or let the school know about absences or late arrive.

It took talking with the teachers, with fritids, and the principal to work out an email-based option.


> “It feels good,” Dettmers said when I asked him how it felt to have contributed to what some are calling a “Sputnik moment” in AI.

I'll present an alternative narrative for the "Sputnik moment" trope, as described by someone who lived through it.

In "Introduction of Computers in Chemical Structure Information Systems, or What Is Not Recorded in the Annals", once easily available at https://www.asis.org/History/12-lynch.pdf but now only in archive.org at https://web.archive.org/web/20130702201907/https://www.asis....

"There were great stirrings in science information at that time because of Sputnik, the challenge to the United States from the Soviet Union in October 1957. Sputnik’s beep-beep tones took the world totally by surprise. When the dust had settled, it became apparent that the Soviets had published their intentions in the open literature, but the science information system in the West was in disarray. The system had not been considered sufficiently important nor was it well enough funded to keep up with the vast increases in the numbers of scientists employed and publishing in the postwar period."

In this interpretation, the issue isn't publishing in the open, but rather who is paying attention to what's published.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: