More

syats · on Sept 14, 2023

For anyone interested in how the obsession for organization can end: https://en.wikipedia.org/wiki/Paul_Otlet

0x445442 · on Sept 14, 2023

Or this guy... https://www.flickr.com/photos/hawkexpress/albums

syats · on Sept 7, 2023

I thought for a second the title was missing a (1936).

syats · on Sept 2, 2023

Link is broken.

syats · on Sept 2, 2023

Somewhere in HN there's a post about disrupting or revolutionizing the laundromat industry, where some person is showered in praise (and later money) for setting up this lousy system.

If it ain't broken, don't fix it.

WeAddValue · on Sept 2, 2023

Long time ago I lived in a shady building where it was broke, broken into that is to get the coins in the machine.

usr1106 · on Sept 2, 2023

And now the LavaWash server could be hacked to steal user data that a washing machine would never need, but the implementer chose to store without reasonable protection.

syats · on Aug 22, 2023

In countries communicating in non-English languages which are written in the latin script, there is a very large use of Latin-1. Even when Latin-1 is "phased out", there are tons and tons of documents and databases encoded in Latin-1, not to mention millions of ill-configured terminals.

I think it makes total sense to implement this.

syats · on Aug 16, 2023

This article contains an excellent description of the work of a mathematician. It should be part of any curriculum in the field.

creer · on Aug 16, 2023

The discussion notes are awesome too. Lots of examples raised.

syats · on July 17, 2023

Thanks for the replication, this is important.

One question, did you try to replicate the other result table (Table 3)?

If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).

In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.

[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).

ks2048 · on July 17, 2023

I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).

Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.

Yeah, you're right, the more classes there are, probably the lower the effect this will have.

syats · on March 7, 2023

Just don't.

SQL:

- does not allow for easy and clean importing of modules/libraries

- is not easily to write tests for

- has limited support for a debugger

- lacks a consistent style for such large queries (plus most textbook cover fairly simple stuff) which means it's hard for a developer to start reading someone else's code (more than in other languages)

- clearly indicates in its name that it is a Query language.

Save yourself the trouble and all your collaborators the pain of working with this code in the future, of trying to add new features, of trying to reuse it in another project.

If you want to operate near the data, use PL/Python for PostgreSQL.

EDIT: Fixed formatting.

dventimi · on March 7, 2023

-PostgreSQL extensions are easy to include and use.

-pgTAP exists for testing.

-A large query in SQL is not made smaller but translating it into an ORM DSL.

-If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data. If that's true, then why would you use them for querying data?

nfw2 · on March 7, 2023

> If "Query" in "SQL" means it's for querying data, then evidently "Query" not being in say Java or Python means those languages are NOT meant for querying data

If X then Y does not imply if not X then not Y. Java and Python do not indicate a purpose in their name because they are general-purpose.

dventimi · on March 7, 2023

Are they meant for querying data?

syats · on March 7, 2023

Re modules/libraries: I meant it is not easy to write a piece of SQL code, and then import it into several queries to reuse it, or lend it to someone else for use on their on schema. It is possible, yes, but seldom done, because it is hell. PostgreSQL extensions could be used for this purpose, but developing an extension requires a different set of SQL statements (or luckily, python or c) than those used by the user of the extension, which makes compounding them a bit hard. Not impossible, just hard to maintain,

About your last point, I don't think that was my line of reasoning, but, yes, for the love of what is precious, don't open SQL files as python/java file objects and then parse and rummage through them to find the data you are looking for. Not impossible, just hard to maintain.

Thanks for pointing out pgTAP, didn't know about this.

For some reason, data-science folks haven't yet caught up with ORMs.. I don't know if this is good or bad, but (as the OP shows) they are more used to rows and columns (or graphs) than objects. Maybe that will change one day.

dventimi · on March 7, 2023

> maybe that will change one day

I pray that it never does.

https://blog.codinghorror.com/object-relational-mapping-is-t...

dventimi · on March 7, 2023

As for sharing SQL, that's easy to do within a database using views. Across databases with possibly different data models, that's not something I personally ever want to do.

freilanzer · on March 8, 2023

Also, there is MindsDB: https://mindsdb.com/

syats · on Feb 9, 2023

Yes, but not in the form of chatbots.

Among other things, a LLM can be seen as a store which you query and get results from. A chatbot is cute because it formats output text to look like conversation, and the recent applications are nice because the query (now known as prompt) can be complicated and long, and can influence the format and length of the results.

But the cool stuff is being able to link the relatively small amount of text you input as a query, into many other chunks of texts that are semantically similar (waves hands around like helicopter blades). So, an LLM is a sort of "knowledge" store, that can be used for expanding queries, and search results, to make it more likely that a good result seems similar to the input query.

What do I mean by similar? well, the first iteration of this idea is vector similarity (e.g. https://github.com/facebookresearch/DPR). The second iteration is to store the results into the model itself, so that the search operation is performed by the model itself.

This second iteration will lead, IMHO, to a different sort of search engine. Not one over "all the pages" as, in theory at least, google and the like currently work. Instead, it will be restricted to the "well learnt pages", those which, because of volume of repetition, structure of text, or just availability to the training algorithm, get picked up and encoded into the weights.

To make an analogy, is like asking a human who are the Knights of the Round Table and getting back the usual "Percival, Lanceelot and Galahad", but just because the other thousand knights mentioned in some works are not popular enough for that given human to know them.

This is a different sort of search engine than we are used to, one which might be more useful for many (most?) applications. The biases and dangers of it are things we are only starting to imagine.

basch · on Feb 9, 2023

Exactly. Unfortunately, I think the "chat" aspect is obscuring what is actually happening here, and distracting from the achievement.

First, the human input is extremely flexible, but can include instructions. It is natural language programming.

Second, the "conversation" has state. I can give an instruction, and then a followup instruction that adds to the first instruction. Someday down the road there will be two states, your account state (instructions you taught it that it retains as long as you are logged in. Maybe my account can have multiple state buckets/buildings I can enter, one of one set of rules, one for another. Could call them programs or routines. (computer execute study routine)) and temporary state (instructions it retains only for the duration of the conversation/search.)

The exciting part here is being able to query data and manipulate it in memory. Making a search, refining the search, redirecting the search in a different direction when its not working. That collaborative, iterative type search doesnt really exist at the moment. I cant tell google "the results you just returned are garbage, here is why, try again."

It is more like a fuzzy commandline. The chatbotness is just a layer of cute on top, that isnt completely necessary.

cjauvin · on Feb 9, 2023

I never thought of this, but now that I see your idea of the "two levels of state", I find it incredibly likely and clear that it's going to work like that eventually, yes.

basch · on Feb 9, 2023

And lets look at the natural evolution/expectation from a corporation like Microsoft. They will sell this as a service to businesses. It is again, like Vista and others, an enterprise product being beta tested on consumers Joe Public.

The enterprise product will start with its own state. It will be something you can limit to specific tasks. Youll be able to write an app that takes instructions and outputs data in magical ways completely different than anything before. Normal data manipulation programming will be replaced by this weird opaque black box, that is ever changing, and you will need to trust consistently outputs the same thing given an identical input.

syats · on Feb 8, 2023

This is bad advice for a couple of reasons:

1. It is expensive. 2. It moves complexity away from you and onto your providers, so it doesn't really solve the problem, only hides it from you (at a price). 3. The overall cost (energy, person-hours, material) of even the smallest project grows a lot with this approach. Even if you have the money to pay for it, you are wasting a bunch of resources around the world just for an illusion of peace of mind. 4. Most importantly, it will still fail (as all systems eventually do) and then you have no idea where it failed or how to fix it. All you can do is file some support tickets at big-corp support center and watch for updates on their twitter feed.

A lot of people complain here on HN about the sad, over-complicated, state of software-engineering, the need to know more and more concepts and to manage more and more tech "stacks" just to accomplish boring, formerly simple, tasks. One reason for this sad state is the philosophy expressed in the parent comment.