Hacker News new | past | comments | ask | show | jobs | submit | scribu's comments login

It’s interesting that they decided to move all of the architecture-specific image-to-embedding preprocessing into a separate library.

Similar to how we ended up with the huggingface/tokenizers library for text-only Tranformers.


This seems to be a system to generate better prompts to be fed into a base multimodal model.

Interesting, but title is definitely clickbait.


They only did that for image generation. The more interesting part is that an LLM can approach or find the correct caption for an image, video or audio during test time with no training using only the score as a guide. It's essentially working blind almost like the game Marco Polo where the scorer is saying "warmer" or "colder" while the LLM is finding its way towards the goal. This is an example of emergent capabilities since there are no examples of this in the training data.


Actually, it's the name of the paper. And while the team also developed and released a system to elicit the behavior by doing what you described, it's entirely possible that the researchers thought the title to be the most important finding in their work.


Exactly! There is definitely something wrong with FAIR.


You could run the build process with chroot or inside Docker, so that the hardcoded paths actually resolve to a designated subdirectory.


Incidentally, that’s what is usually done in Nixpkgs in similar situations when there’s no better alternative, see buildFHSEnv et al.


In many cases the build output also has hardcoded paths unfortunately

so doing `brew install` inside a container with the proper volumes it’s not sufficient to fix the issue. Everything would have to run from within the container as well.


From their Notion page:

> Skywork-OR1-32B-Preview delivers the 671B-parameter Deepseek-R1 performance on math tasks (AIME24 and AIME25) and coding tasks (LiveCodeBench).

Impressive, if true: much better performance than the vanilla distills of R1.

Plus it’s a fully open-source release (including data selection and training code).


“Fill in the gaps by using context” is the hard part.

You can’t pre-bake the context into an LLM because it doesn’t exist yet. It gets created through the endless back-and-forth between programmers, designers, users etc.


But the end result should be a fully-specced design document. That might theoretically be recoverable from a complete program given a sufficiently powerful transformer.


Peter Naur would disagree with you. From "Programming as Theory Building":

A very important consequence of the Theory Building View is that program revival, that is reestablishing the theory of a program merely from the documentation, is strictly impossible. Lest this consequence may seem un- reasonable it may be noted that the need for revival of an entirely dead program probably will rarely arise, since it is hardly conceivable that the revival would be assigned to new programmers without at least some knowledge of the theory had by the original team. Even so the The- ory Building View suggests strongly that program revival should only be attempted in exceptional situations and with full awareness that it is at best costly, and may lead to a revived theory that differs from the one originally had by the program authors and so may contain discrep- ancies with the program text.

The definition of theory used in the article:

a person who has or possesses a theory in this sense knows how to do certain things and in addition can support the actual doing with explanations, justi- fications, and answers to queries, about the activity of concern.

And the main point on how this relate to programming:

- 1 The programmer having the theory of the program can explain how the solution relates to the affairs of the world that it helps to handle. Such an explanation will have to be concerned with the manner in which the af- fairs of the world, both in their overall characteristics and their details, are, in some sense, mapped into the pro- gram text and into any additional documentation.

- 2 The programmer having the theory of the program can explain why each part of the program is what it is, in other words is able to support the actual program text with a justification of some sort. The final basis of the justification is and must always remain the programmer’s direct, intuitive knowledge or estimate.

- 3 The programmer having the theory of the program is able to respond constructively to any demand for a modification of the program so as to support the affairs of the world in a new manner. Designing how a modifi- cation is best incorporated into an established program depends on the perception of the similarity of the new demand with the operational facilities already built into the program. The kind of similarity that has to be per- ceived is one between aspects of the world.


The point is that you’d expect a roughly even distribution of clockwise and counterclockwise spins, not all of them to rotate in the same direction.


wouldn't it be the case that you would see almost exactly 50/50 if all galaxies had parallel axes and rotated in the same absolute direction?


why? if you subscribe to big bang then all matter got the same "initial kick". would be easier to assume same spin?


From my understanding, the big bang requires that the proto-universe was in a completely homogenous state that was then pushed out of that equilibrium for some reason. But that reason doesn't require non-zero angular momentum. It only requires that a the proto-universe was homogenous and now the universe isn't. And that is what separates pre and post big bang. I could be wrong, I am not a cosmologist. Would be happy to hear from one though.


So what caused the "initial kick" to favor one side?


What causes a perfectly symmetric ball on top of a perfectly symmetric hill to roll down via one side? (Probably quantum randomness if everything else is perfectly symmetric)


What caused this universe to favor matter over anti-matter?

So many unanswered questions.


If the base models already have the “reasoning” capability, as they claim, then it’s not surprising that they were able to get to SOTA using a relatively negligible amount of compute for RL fine-tuning.

I love this sort of “anti-hype” research. We need more of it.


The install speed alone makes it worthwhile for me. It went from minutes to seconds.


I was working on a Raspberry Pi at a hackathon, and pip install was eating several minutes at a time.

Tried uv for the first time and it was down to seconds.


Why would you be redoing your venv more than once?


Once rebuilding your venv takes negligible time, it opens up for all kinds of new ways to develop. For example I now always run my tests in a clean environment, just to make sure I haven't added anything that only happens to work in my dev venv.


That's smart. Oh, you used `pip install` to fix a missing import, but forgot to add it to pyproject.toml? You'll find out quickly.


I thought you wanted to test your program, but seems you want to make sure that your unzip library keeps working :D


It has nothing to do with redoing venv: some package installs were just taking multiple minutes.

I cancelled one at 4 minutes before switching to uv and having it finish in a few seconds


If only linux distributions had existed for decades…


Not sure how a linux distribution replaces python package management but ok?


The HN submission title is editorialized in a non-helpful way. Why beat a dead horse instead of focusing on what’s actually new in TFA?

The linked paper proposes an obvious-in-retrospect form of data augmentation: shuffle the order of the premises, so that the model can’t rely on spurious patterns. That’s kinda neat.


Correct, I updated the title of the original paper. Thank you your bringing up.


Would be curious to know how this stacks up against Coconut [1] which also uses latent space for reasoning.

[1] https://arxiv.org/abs/2412.06769


Definitely curious, this looks very similar to Coconut, even down to the CoT encoding process in Figure 2. They go into a lot more detail though, seems like parallel innovation.


I wonder whether even those models which emit thinking tokens in reality do most of the work within the latent space so the difference is only superficial


I'm behind on reading but don't all models use continuous embeddings to represent reasoning?


I believe the "continuous" in Coconut means that the CoT is in the continuous latent space, instead of being on output tokens (see Fig. 1).


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: