Hacker News new | past | comments | ask | show | jobs | submit login

The whole "mystery" of transformer is that instead of a linear sequence of static weights times values in each layer, you now have 3 different matrices that are obtained from the same input through multiplication of learned weights, and then you just multiply the matrices together. I.e more parallelism which works out nice, but very restrictive since the attention formula is static.

We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter. I dunno if this is even possible in the traditional sense of gradients due to chaotic effects (i.e small changes reflect big shifts in performance), it may have to be some form of genetic algorithm or pso that happens under the hood.




>The whole "mystery" of transformer is that instead of a linear sequence of static weights times values in each layer, you now have 3 different matrices that are obtained from the same input through multiplication of learned weights, and then you just multiply the matrices together. I.e more parallelism which works out nice, but very restrictive since the attention formula is static.

That's not it at all. What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them. The big theoretical advantage over RNNs (which were used for sequences prior to transformers), is that transformers support this in a lossless way, as each element has full access to all the information in every other element in the sequence (or at least all the ones that occurred before it in time sequences). RNNs and "linear transformers" on the other hand compress past values, so generally the last element of a long sequence will not have access to all the information in the first element of the sequence (unless the RNN internal state was really really big so it didn't need to discard any information).


>What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them.

They do that in theory. In practice, its just all matrix multiplication. You could easily structure a transformer as a bunch of fully connected deep layers and it would be mathematically equivalent, just computationally inefficient.


> We arent going to see more progress until we have a way to generalize the compute graph as a learnable parameter

That's a bold statement since a ton of progress has been made without learning the compute graph.


From my naive perspective, there seems to be a plateau, that everyone is converging on, somewhere between ChatGPT 3.5 and 4 level of performance, with some suspecting that the implementation of 4 might involve several expert models, which would already be extra sauce, external to the LLM. This, combined with the observation that generative models converge to the same output, given the same training data, regardless of architecture (having trouble finding the link, it was posted here some weeks ago), external secret sauce, outside the model, might be where the near term gains are.

I suppose we'll see in the next year!


We already have competitors to Transformers

https://arxiv.org/abs/2312.00752


Where do I enter in my credit card info?


You hire people to implement a product based on this?


A ton of progress can be made climbing a tree, but if your goal is reaching the moon it becomes clear pretty quickly that climbing taller trees will never get you there.


True, but it is the process of climbing trees that gives the insight whether taller trees help or not and if not, what to do next.


Not true. Climbing trees for millions of years taught us nothing about orbits, or rockets, or literally incomprehensible to human distances, or the vacuum of space, or any possible way to get higher than a tree.

We eventually moved on to lighter than air flight, which once again did not teach us any of those things and also was a dead end from the "get to the sky/moon" perspective, so then we invented heavier than air flight, which once again could not teach us about orbits, rockets, distances, or the vacuum of space.

What got us to the moon was rigorous analysis of reality with math to discover Newton's laws of motion, from which you can derive rockets, orbits, the insane scale of space, etc. No amount of further progress in planes, airships, kites, birds, anything on earth would ever have taught us the techniques to get to the moon. We had to analyze the form and nature of reality itself and derive an internally consistent model of that physical reality in order to understand anything about doing space.


> Climbing trees for millions of years taught us nothing about

Considering the chasm in the number of neurons between the apes and most other animals, I think one could claim that climbing those trees had some contribution to the ability to understand those things. ;) Navigating trees, at weight and speed, has a minimum intelligence reqiurement.


With enough thrust, even p̵i̵g̵s̵ trees can fly.


We have made progress in efficiency, not functionality. Instead of searching google or stack overflow or any particular documentation, we just go to Chatgpt.

Information compression is cool, but I want actual AI.


The idea that there has been no progress in functionality is silly.

Your whole brain might just be doing "information compression" by that analogy. An LLM is sort of learning concepts. Even Word2Vec "learned" than king - male + female = queen and that's a small model that's really just one part (not exact, but similar) of a transformer.


Let me rephrase that.

One level deep information compression is cool, but I want actual AI.

Its true that our brains compress information, but we compress it in a much more complex manner, in the sense that we can not only recall stuff, but also execute a decision tree that often involves physical actions to find the answer we are looking for.


An LLM isn't just recalling stuff. Brand new stuff, which it never saw in it's training, can come out.

The minute you take a token and turn it into an embedding, then start changing the numbers in that embedding based on other embeddings and learned weights, you are playing around with concepts.

As for executing a decision tree, ReAct or Tree of Thought or Graph of Thought is doing that. It might not be doing it as well as a human does, on certain tasks, but it's pretty darn amazing.


>Brand new stuff, which it never saw in it's training, can come out.

Sort of. You can get LLMs to produce some new things, but these are statistical averages of existing information. Its kinda like a static "knowledge tree", where it can do some interpolation, but even then, its interpolation based on statistically occurring text.


The interpolation isn't really based on statistically occurring text. It's based on statistically occurring concepts. A single token can have many meanings depending on context and many tokens can represent a concept depending on context. A (good) LLM is capturing that.


Neither just text or just concepts, but text-concepts — LLMs can only manipulate concepts as they can be conveyed via text. But I think wordlessly, in pure concepts and sense-images, and serialize my thoughts to text. That I have thoughts that I am incapable of verbalizing is what makes me different from an LLM - and, I would argue, actually capable of conceptual synthesis. I have been told some people think “in words” though.


Nope, you could shove in an embedding that didn't represent an existing token. It would work just fine.

(if not obvious.. you'd shove it in right after the embedding layer...)


Fascinating. What’s “actual AI”?


> What’s “actual AI”

Is Ibn Sina (Avicenna, year ~1000) fine?

> [the higher faculty proper of humans is] the primary function of a natural body possessing organs in so far as it commits acts of rational choice and deduction through opinion; and in so far as it perceives universal matters

Or, "Intelligence is the ability to reason, determining concepts".

(And a proper artificial such thing is something that does it well.)


It is a tool that has the ability to craft a prompt that will break current state of the art model.

It is a tool that can be given a project in language X and produce an idomatic port in language Y.

It is a tool that given a 20 pages paper spec will ask the questions needed to clarify the specs.


Something that can reason and figure things out without having ever been exposed to the information during training.


This either includes GPT-4 or excludes people


It’s whatever computers can’t do, dummy! :P


This is basically this - it can learn ignore some paths, and amplify something more important, then you can just cut this paths without sensible loss of quality. The problem is that you are not going to win anything from this - non-matrix multiplication would be slower or the same.


The issue is that you are thinking of this in terms of information compression, which is what LLMs are.

Im more concerned with an LLM having the ability to be trained to the point where a subset of the graph represents all the nand gates necessary for a cpu and ram, so when you ask it questions it can actually run code to compute them accurately instead of offering a statistical best guess, i.e decompression after lossy compression.


Just give it a computer? Even a virtual machine. It can output assembly code or high level code that gets compiled.


The issue is not having access to the cpu, the issue is that the model being able to be trained in such a way that it has representative structures for applicable problem solving. Furthermore, the structures itself should

Philosophically, you can start ad hoc-ing functionalities on top of LLMs and expect major progress. Sure, you can make them better, but you will never get to the state where AI is massively useful.

For example, lets say you gather a whole bunch of experts in respective fields, and you give them a task to put together a detailed plan on how to build a flying car. You will have people doing design, doing simulations, researching material sourcing, creating CNC programs for manufacturing parts, sourcing tools and equipment, writing software, e.t.c. And when executing this plan, they would be open to feedback for anything missed, and can advise on how to proceed.

The AI with above capability should be able to go out on the internet, gather respective data, run any soft of algorithms it needs to run, and perhaps after a month of number crunching on a cloud rented TPU rack produce step by step plan with costs on how to do all of that. And it would be better than those experts because it should be able to create a much higher fidelity simulations to account for things like vibration and predict if some connector if going to wobble loose .


> Philosophically, you can start ad hoc-ing functionalities on top of LLMs and expect major progress. Sure, you can make them better, but you will never get to the state where AI is massively useful.

Evolution created various neural structures in biological brains (visual cortex, medulla, thalamus, etc) rather ad-hoc, and those resulted in "massively useful" systems. Why should AI be different?


I mean, we could definitely run architectures through simulated evolution with genetic algorithms, but then you arrive at the same problem as humans do, which is that you end up with a statistically best solution for given conditions. Sure, that could be a form of AI but there is likely a better (and likely faster) way to build an AI that isn't fundamentally statistical in nature and is adaptable to any and all problems.


LLMs seem like the least efficient way to accomplish this. NAND gates, for example, are inherently 1-bit operators, but LLMs use more. If weights are all binary, than gradients are restricted to -1, 0, and 1, which doesn't give you much room to make incremental improvements. You can add extra bits back, but that's pure overhead. But all this is besides the real issue, which is that LLMs and NNs in general are inherently fuzzy; they guess. Computers aren't, we have perfect simulators.

Consider how humans design things. We don't talk through every CPU cycle to convince ourself a design works, we use bespoke tooling. Not all problems are language shaped.


From what you've written, I don't see why any of this would require the LLM to "be trained to the point where a subset of the graph represents all the nand gates necessary for a cpu and ram" - you'd just be emulating a CPU, but slower.

Tool usage is better, because the LLM can access the relevant computing/simulation at the highest fidelity and as fast as they can run on a real or virtual computer, rather than emulated poorly in a giant pyramid of matrix multiplications.

Am I missing the point?


Well, just remember that NAND gates are made of transistors themselves which are a statistical model of a sort… just designed to appear digital when combined to that NAND level.

This is why I am very interested in analog again—quantum stuff is statistical already, so why go from statistical (analog) to digital (huge drop off of performance, e.g. just look at basic addition in a ALU) and back to statistical. Very interested. Not sure if it will ever be worth it, but can’t rule it out.


>a way to generalize the compute graph as a learnable parameter.

Agreed. Seems analogous with how human mental processes are used to solve the kind of problems we'd like LLMs to solve (going beyond "language processing" which transformers do well, to actual reasoning which they can only mimic). Although you risk it becoming a Turing machine by giving it flow control & then training is a problem as you say. Perhaps not intractable though.


hyperparameter tuning does already go some of the way towards learning the compute graph, though very constrained and with a lot more training required.


How can gradient descent work on compute graphs when the space of compute graphs is discrete?


> How can gradient descent work on compute graphs when the space of compute graphs is discrete?

You can un-discretize the space of compute graphs by interpolating its points by simplices. More precisely, each graph is a subgraph of the complete graph, and the subgraph is identified by the indicator function of its edges whose values are either 0 or 1. By using weighted edges with values between 0 and 1, the space of all graphs (with the same number of vertices) becomes continuous and connected, and you can gradient move around it in small steps.

Of course, "compute graphs" are more general beasts than "graphs", but it is likely that the same idea will apply. At least, for a reasonably large class of compute graphs.


It can’t. There’s no gradient since it’s not a sufficiently nice space for them. You can use gradient free methods but I’d be shocked if there was an efficient enough way to do that


I don't know if it can in the traditional sense of back propagation.

I think that Hebbian Learning is going to make a comeback at some point and time which will be used to connect static subgraphs to to other subgraphs subgraphs, which can be trained either separately or on the fly.


Perhaps in a way similar to this paper: https://arxiv.org/abs/1806.09055


I wonder why this hasn't taken off.


From brief look at the paper, they are doing a gradient descent of the architecture based on validation loss, which does good for efficiency, but its not ground breaking. The problem is that you are still training towards a target of a correct answer. I don't think this is gonna be applicable in the future, in the sense that we have to train on other things (like logical consistency somehow encoded into the network), as well as correct answers.


Your expectations are pretty high. Differentiable architecture search as you mentioned in the original comment is one thing; going beyond empirical risk minimization-based learning is another thing entirely. In fact, they seem mostly orthogonal.

That aside, it seems like AI has had the most empirical success by not imposing hard constraints/structure, but letting models learn completely "organically". The computationalists (the folks who have historically been more into this "AI has to have things like logical consistency embedded into its structure" kind of thinking) seem to have basically lost, empirically. Who even knows what Soar[1] is nowadays? Maybe some marriage of the two paradigms will lead to better results, but I doubt that things will head in that direction anytime soon given how massively far just having parallelizable architectures and adding more parameters has gotten us.

[1] https://en.wikipedia.org/wiki/Soar_(cognitive_architecture)


They expectations high, but its not so much as orthogonal as more basic. Our brains work on add/multiply/activation this is well known. But the composition of the neural connection strengths in our brain that makes us us is definitely not trained on any sort of final loss. Or at least not completely.


I'm not sure that AI has been successful recently because of its similarities to the human brain. It seems like the project of making human-like AI (in the sense of, models that function similarly to the brain) have had a lot less empirical success than the project of trying minimize loss on a dataset, whatever that takes. Like, look what happened to Hebbian learning, as you mentioned in your other comment. Completely absent from models that are seriously trying to beat SOTA on benchmarks.

Like, it really just seems like LLMs are a really good way of doing statistics rather than the closest model we have of the brain/mind, even if there are some connections we can draw post-hoc between transformers and the human brain.


Genetic algorithms figured out GI the first time, but it took a while.


Could you please expand?


Evolution built our brains.

Though to be fair, actual biological evolution is more complex than simple genetic algorithms. More like evolution strategies with meta-parameter-learning and adaptive rate tuning among other things.


can you explain exactly what you mean by this? I understand what a compute graph is, but I'm not getting the idea of making a learnable parameter.


nevermind, after looking at my own question 3-4 times it clicked.


Couldn't find your contact info. Email me?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: