Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At some point it will be computationally cheaper to predict the next pixel than to classically render the scene, when talking about scenes beyond a certain graphical fidelity.

The model can infinitely zoom in to some surface and depict(/predict) what would really be there. Trying to do so via classical rendering introduces many technical challenges



I work in game physics in the AAA industry, and I have studied and experimented with ML on my own. I'm sceptical that that's going to happen.

Imagine you want to build a model that renders a scene with the same style and quality of rasterisation. The fastest way to project a point on the screen is to apply a matrix multiplication. If the model needs to keep the same level of spatial consistency as the resterizer, it has to reproject points in space somehow.

But a model is made of a huge number of matrix multiplications interspersed by non-linear activations. Because of these non-linearities, it can't map a single matrix multiplication to its underlying multiplications. It has to recover the linearity by approximating the transformation with many more operations.

Now, I know that transformers can exploit superposition when processing a lot of data. I also know neural networks could come up with all sorts of heuristics and approximations based on distance or other criteria. However, I've read multiple papers showing that large models have a large number of useless parameters (the last one showed that their model could be reduced to just 4% of the original parameters, but the process they used requires re-training the model from scratch many times in a deterministic way, so it's not practical for large models).

This doesn't mean we might not end up using them anyway for real-time rendering. We could accept the trade-off and give up some coherence for more flexibility. Or, given enough computational power, a larger model could be coherent enough for the human eye, while its much larger cost will be justified by its flexibility. In a way like analogous systems are much faster than digital ones, but we use digital ones anyway because they can be reprogrammed.

With frame prediction and upscaling, we have this trade-off already.


I imagine a future where the “high level” stuff in the environment is pre defined by a human (with or without assistance from AI), and then AI sort of fills in the blanks on the fly.

So for example, a game designer might tell the AI the floor is made of mud, but won’t tell the AI what it looks like if the player decides to dig a 10 ft hole in the mud, or how difficult it is to dig, or what the mud sounds like when thrown out of the hole, or what a certain NPC might say when thrown down the hole, etc.


> At some point it will be computationally cheaper to predict the next pixel than to classically render the scene,

This is already happening to some extent, some games struggle to reach 60 FPS at 4K resolution with maximum graphics settings using traditional rasterization alone, so technologies like DLSS 3 frame generation are used to improve performance.


Instead of the binary of traditional games vs AI, it's worth thinking more about hybrids.

You could have a stripped down traditional game engine, but without any rendering, that gives a richer set of actions to the neural net. Along with some asset hints, story, a database (player/environment state) the AI can interact with, etc. The engine also provides bounds and constraints.

Basically, we need to work out the new boundary between engine and AI. Right now it's "upsample and interpolate frames", but as AI gets better, what does that boundary become?


I think that's more to do with poor optimization that the actual level of graphical fidelity that requires it.


Can you explain why this is the case? I don't understand why.


I'll try! Let's consider a tree blowing in the wind:

To classically render this in any realistic fashion, it quickly gets complex. Between the physics simulation (rather involved) and the number of triangles (trees have many branches and leaves), you're going to be doing a lot of math.

I'll emphasize "realistic" - sure, we can real-time render trees in 2025 that look.. ok. However, take more than a second to glance at it and you will quickly start to see where we have made compromises to the tree's fidelity to ensure it renders at an adequate speed on contemporary hardware.

Now consider a world model trained on enough tree footage that it has gained an "intuition" about how trees look and behave. This world model doesn't need to actually simulate the entire tree to get it to look decent.. it can instead directly output the pixels that "make sense". Much like a human brain can "simulate" the movement of an object through space without expending much energy - we do it via prediction based on a lot of training data, not by accurately crunching a bunch of numbers.

That's just one tree, though - the real world has a lot of fidelity to it. Fidelity that would be extremely expensive to simulate to get a properly realistic output on the other side.

Instead we can use these models which have an intuition for how things aught to look. They can skip the simulation and just give you the end result that looks passable because it's based on predictions informed by real-world data.


Don't you think a sufficiently advanced model will end up emulating what normal 3D engines already do mathematically? At least for the rendering part, I don't see you can "compress" the meaning behind light interaction without ending up with a somewhat latent representation of the rendering equation


If I am not mistaken, we are already past that. The pixel, or token, gets probability-predicted in real time. The complete, shaded pixel, as you will, gets computed ‘at once’ instead of layers of simulation. That’s the LLM’s core mechanism.

If the mechanism allows for predicting how the next pixel will look like, which includes the lighting equation, then there is no need anymore for a light simulation.

Would also like to know how Genie works. Maybe some parts get indeed already simulated in a hybrid approach.


The model has multiple layers which are basically a giant non-linear equation to predict the final shaded pixel, I don't see how it's inherently difference from a shader outputing a pixel "at once".

Correct me if I'm wrong, but I don't see how you can simulate a PBR pixel without doing ANY pbr computation whatsoever.

For example one could imagine a very simple program computing sin(x), or a giant multi-layered model that does the same, wouldn't it just be a latent, more-or-less compressed version of sin(x)?


in this case I assume it would be taken from motion / pixels of actual trees blowing in the wind. Which does serve up the challenge of like how does dust blow on a hypothetical gameworld alien planet?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: