NVFP4 is the thing no one saw coming. I wasn't watching the MX process really, so I cast no judgements, but it's exactly what it sounds like, a serious compromise in resource constrained settings. And it's in the silicon pipeline.
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
This reads like a badly done, sponsored hype video on YouTube.
So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .
Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.
Looking into NVFP4/Nvidia vs MXFP4/AMD the summation was that seem to be pretty close when including the MI355X which leads in VRAM and throughput but trails in accuracy (slightly)--and for that mixing in MXFP6 makes up for it.
I'm a bit later in my career and I've been involved with modern machine learning for a long time which probably affects my views on this, but I can definitely relate to aspects of it.
I think there are a couple of good signals in what you've said but also some stuff (at least by implication/phrashing) that I would be mindful of.
The reason why I think your head is fundamentally in a good place is that you seem to be shooting for an outcome where already high effort stays high, and with the assistance of the tools your ambition can increase. That's very much my aspiration with it, and I think that's been the play for motivated hackers forever: become as capable as possible as quickly as possible by using every effort and resource. Certainly in my lifetime I've seen things like widely distributed source code in the 90s, Google a little later, StackOverflow indexed by Google, the mega-grep when I did the FAANG thing, and now the language models. They're all related (and I think less impressive/concerning to people who remember pre-SEO Google, that was up there with any LLM on "magic box with reasonable code").
But we all have to self-police on this because with any source of code we don't understand, the abstraction almost always leaks, and it's a slippery slope: you get a little tired or busy or lazy, it slips a bit, next thing you know the diff or project or system is jeopardized, and you're throwing long shots that compound.
I'm sure the reviewers can make their own call about whether you're in an ok place in terms of whether you're making a sincere effort or if you've slipped into the low-integrity zone (LLVM people are serious people), just be mindful that if you want the most out of it and to be welcome on projects and teams generally, you have to keep the gap between ability and scope in a band: pushing hard enough to need the tools and reviewers generous with their time is good, it's how you improve, but go too far and everyone loses because you stop learning and they could have prompted the bot themselves.
There's nontrivial historical precedent for this exact playbook: when a new paradigm (Lisp machines and GOFAI search, GPU backprop, softmax self-attention) is scaling fast, a lot of promises get made, a lot of national security money gets involved, and AI Summer is just balmy.
But the next paradigm breakthrough is hard to forecast, and the current paradigm's asymptote is just as hard to predict, so it's +EV to say "tomorrow" and "forever".
When the second becomes clear before the first, you turk and expert label like it's 1988 and pray that the next paradigm breakthrough is soon, you bridge the gap with expert labeling and compute until it works or you run out of money and the DoD guy stops taking your calls. AI Winter is cold.
And just like Game of Thrones, no I mean no one, not Altman, not Amodei, not Allah Most Blessed knows when the seasons in A Song of Math and Grift will change.
Now imagine if someone combined Jia Tan patience with swiss-cheese security like all of our editor plugins and nifty shell user land stuff and all that.
Developer stuff is arguably the least scrutinized thing that routinely runs as mega root.
I wish I could say that I audit every elisp, neovim, vscode plugin and every nifty modern replacement for some creaky GNU userland tool. But bat, zoxide, fzf, atuin, starship, viddy, and about 100 more? Nah, I get them from nixpkgs in the best case, and I've piped things to sh.
Write a better VSCode plugin for some terminal panel LLM gizmo, wait a year or two?
I'd like to "reclaim" both AI and machine learning as relatively emotionally neutral terms of art for useful software we have today or see a clearly articulated path towards.
Trying to get the most out of tools that sit somewhere between "the killer robots will eradicate humanity", " there goes my entire career", "fuck that guy with the skill I don't want to develop, let's take his career", and "I'm going to be so fucking rich if we can keep the wheels on this" is exhausting.
I don't think that's achievable with all the science fiction surrounding "AI" specifically. You wouldn't be "reclaiming" the term, you'd be conquering an established cultural principality of emotionally-resonant science fiction.
Which is, of course, the precise reason why stakeholders are so insistent on using "AI" and "LLM" interchangeably.
Personally I think the only reasonable way to get us out of that psycho-linguistic space is just say "LLMs" and "LLM agents" when that's what we mean (am I leaving out some constellation of SotA technology? no, right?)
I personally regard posterior/score-gradient/flow-match style models as the most interesting thing going on right now, ranging from rich media diffusers (the extended `SDXL` family tree which is now MMDiT and other heavy transformer stuff rapidly absorbing all of 2024's `LLM` tune ups) all the way through to protein discovery and other medical applications (tomography, it's a huge world).
LLM's are very useful, but they're into the asymptote of expert-labeling and other data-bounded stuff (God knows why the GB200-style Blackwell build-out is looking like a trillion bucks when Hopper is idle all over the world and we don't have a second Internet to pretrain a bigger RoPE/RMSNorm/CQA/MLA mixture GPT than the ones we already have).
The fast interconnect between nodes has aaplications in inference at scale (big KV caches and other semi-durable state, multi-node tensor parallelism on mega models).
But this article in particular is emphasizing extreme performance ambitions for columnar data processing with hardware acceleration. Relevant to many ML training scenarios, but also other kinds of massive MapReduce-style (or at least scale) workloads. There are lots of applications of "magic massive petabyte plus DataFrame" (which is not I think solved in the general case).
I dramatically prefer modern C++ to either of Python or Rust in domains where it's a toss up. It's really nice these days.
Like any language that lasts (including Python and Rust) you subset it over time: you end up with linters and sanitizers and static analyzers and LSP servers, you have a build. But setting up a build is a one-time cost and maintaining a build is a fact of life, even JavaScript is often/usually the output of a build.
And with the build done right? Maybe you dont want C++ if youre both moving fast and doing safety or security critical stuff (as in, browser, sshd, avionics critical) but you shouldnt be moving fast on avoinics software to begin with.
And for stuff outside of that "even one buffer overflow is too many" Venn?
C++ doesn't segfault more than either of those after its cleared clang-tidy and ASAN. Python linking shoddy native stuff crashes way more.
I agree that python is often painful to get right. But asan really, doesn't help you unless you have tests that cover the path with the mistake it would catch. Safe rust doesn't segfault ever. Unsafe rust is still easier to get right for me than c, but mostly because c targets weird architectures with weird rules. Rust is also easier for me to develop fast in than either python or c++. For c++ that is because fragmented package management and build systems, and syntax that lets me express things I don't want to, like a race condition. Python slows me down because the duck typing can't help me reason about my design like static explicitly converted types can.
> That is why cve-rs uses #![deny(unsafe_code)] in the entire codebase. There is not a single block of unsafe code (except for some tests) in this project.
> cve-rs implements the following bugs in safe Rust:
Rust’s borrow checker has been mathematically verified to be always sound.[0] These are edge case implementation bugs that user code is unlikely to touch. I have never had a soundness hole/segfault from safe Rust, and I am not aware of any code in the wild which stumbled into one of these soundness holes either. The main bug is [1], which is very contrived lifetime shenanigans that wouldn't appear in real code. Some bugs [2] are miscompilations coming from the LLVM side. If implemented perfectly, all safe Rust code would be sound, but of course all programs have bugs. The important part is that 99.9% of safe Rust code is sound.
I don't feel I was bombastic, simply ignorant. I knew there had been unsoundness bugs in the past, but I thought they had all been fixed at this point. I see why it is taking them so long to fix this one, but yes it is not 100%. I will rather say that I have had no segfaults from safe rust in the decade I've been using it, and that even after 35 years in c++ I still average a couple a month. Probably worse now because I'm rusty when I go back to use it for something. It could be argued I'm really bad at c++, and maybe that is true, but then maybe I shouldn't be using it, and should use rust instead, it allows me to get the same job done.
If you don't have good tests with coverage information then your Python and Rust code is buggy too. If you want "as low of defects as I can get from the compiler alone" then your options are things like Haskell and theory-laden TypeScript.
If you have low-defect appropriate test coverage, ASAN approaches linear typing for the class of bugs that the borrow checker catches.
Python and Rust have their sweet spots just like C++ does, I use all three. The meme that either is a blanket replacement for all of C++'s assigned duties is both sikly by observation and not demonstrated by enthusiastic adoption in C/C++ in the highest-demand regimes: AI inference, HFT, extreme CERN-scale physics, your web browser.
Python is better for fast prototyping and other iteration critical work, Rust is unmatched for "zero defect target with close to zero cost abstractions" (it was designed for writing a web browser and in that wildly demanding donain I can think of nothing better).
C++ excels in extreme machine economics scenarios where you want a bit more flexibility in your memory management and a bit more license to overrule the compiler and a trivial FFI story. Zig to me looks likely to succeed it in this role someday.
All have their place, software engineers in these domains should know all of them well, and any "one true language" crap is zealoty or cynical marketing or both.
A coworker found a stack use after free in production that no amount of linters or tidyers caught. It only actually happened in the release build, and asan caught it, but only when we pushed to 100% branch coverage instead of just line coverage. Rust makes coverage much easier to get with fewer tests due to it's type system. It seems really clear to me that rust will be the c++ successor. If you need that extra flexibility, you use unsafe. That still feels less foot gun ridden than c++. The slow adoption of rust is because c++ interoperability is bad and people are stuck maintaining c++ behemoths. Green field rust adoption in c++ domains is much higher. Zig is more of a c replacement, but as much as I like it, I don't think that will happen. C is like a standardized ir. But when sonos wanted to make an arm cpu only onnx inference engine for instance, they made it in rust (tract). When c++ developers get to choose rust, we often do. I can throw earlier career c++ devs with modern c++ experience directly into a rust codebase and have them be immediately productive, but also not worry about them doing horrible things.
Yeah, you and I are both entitled to our guesses about Rust and Zig respectively in the future. Both have serious production projects (bun and TigerBeetle are best-in-class to head off any "hobby project" stuff), neither is making serious inroads to the domains I'm talking about: it doesn't get any hotter than the new CUTLASS stuff, and that's greenfield modern C++ with a standardization regime (mdarray) with 100% industry voting as a bloc for "it's still modern C++ at the frontier". HFT shops have at least some Rust at least auditioning, but the reqs are still C++, and they rewrite anything that alpha decays out.
When it's for all the marbles today? It's C++. The future is an open question, but a lot of us are pushing hard for C++.
Ryzen AI 9 395+ with 64MB of LPDDR5 is 1500 new in a ton of factors and 2k with 128. If I have 1500 for a unified memory inference machine I'm probably not getting a Mac. It's not a bad choice per se, llama.cpp supports that harware extremely well, but a modern Ryzen APU at the same price is more of what I want for that use case, with the M1 Mac youre paying for a Retina display and a bunch of stuff unrelated to inference.
Not just LPDDR5, but LPDDR5X-8000 on a 256-bit bus. The 40 CU of RDNA 3.5 is nice, but it's less raw compute than e.g. a desktop 4060 Ti dGPU. The memory is fast, 200+ GB/s real-world read and write (the AIDA64 thread about limited read speeds is misleading, this is what the CPU is able to see, the way the memory controller is configured, but GPU tooling reveals full 200+ GB/s read and write). Though you can only allocate 96 GB to the iGPU on Windows or 110 GB on Linux.
The ROCm and Vulkan stacks are okay, but they're definitely not fully optimized yet.
Strix Halo's biggest weakness compared to Mac setups is memory bandwidth. M4 Max gets something like 500+ GB/s, and M3 Ultra gets something like 800 GB/s, if memory serves correctly.
I just ordered a 128 GB Strix Halo system, and while I'm thrilled about it, but in fariness, for people who don't have an adamant insistence against proprietary kernels, refurbished Apple silicon does offer a compelling alternative with superior performance options. AFAIK there's nothing like Apple Care for any of the Strix Halo systems either.
The 128 GB Strix Halo system was tempting me, but I think I'm going to hold out for the Medusa Point memory bandwidth gains to expand my cluster setup.
I have a Mac Mini M4 Pro 64GB that does quite well with inference on the Qwen3 models, but is hell on networking with my home K3s cluster, which going deeper on is half the fun of this stuff for me.
>The 128 GB Strix Halo system was tempting me, but I think I'm going to hold out for the Medusa Point
I was initially thinking this way too, but I realized a 128GB Strix Halo system would make an excellent addition to my homelab / LAN even once it's no longer the star of the stable for LLM inference - i.e. I will probably get a Medusa Halo system as well once they're available. My other devices are Zen 2 (3600x) / Zen 3 (5950x) / Zen 4 (8840u), an Alder Lake N100 NUC, a Twin Lake N150 NUC, along with a few Pi's and Rockchip SBC's, so a Zen 5 system makes a nice addition to the high end of my lineup anyway. Not to mention, everything else I have maxed out at 2.5GbE. I've been looking for an excuse to upgrade my switch from 2.5GbE to 5 or 10 GbE, and the Strix Halo system I ordered was the BeeLink GTR9 Pro with dual 10GbE. Regardless of whether it's doing LLM, other gen AI inference, some extremely light ML training / light fine tuning, media transcoding, or just being yet another UPS-protected server on my LAN, there's just so much capability offered for this price and TDP point compared to everything else I have.
Apple Silicon would've been a serious competitor for me on the price/performance front, but I'm right up there with RMS in terms of ideological hostility towards proprietary kernels. I'm not totally perfect (privacy and security are a journey, not a destination), but I am at the point where I refuse to use anything running an NT or Darwin kernel.
That is sweet! The extent of my cluster is a few Pis that talk to the Mac Mini over the LAN for inference stuff, that I could definitely use some headroom on. I tried to integrate it into the cluster directly by running k3s in colima - but to join an existing cluster via IP, I had to run colima in host networking mode - so any pods on the mini that were trying to do CoreDNS networking were hitting collisions with mDNSResponder when dialing port 53 for DNS. Finally decided that the macs are nice machines but not a good fit for a member of a cluster.
Love that AMD seems to be closing the gap on the performance _and_ power efficiency of Apple Silicon with the latest Ryzen advancements. Seems like one of these new miniPCs would be a dream setup to run a bunch of data and AI centric hobby projects on - particularly workloads like geospatial imagery processing in addition to the LLM stuff. Its a fun time to be a tinkerer!
It’s not better than the Macs yet. There’s no half assing this AI stuff, AMD is behind even the 4 year old MacBooks.
NVDIA is so greedy that doling out $500 dollars will only you get you 16gb of vram at half the speed of a M1 Max. You can get a lot more speed with more expensive NVDIA GPUs, but you won’t get anything close to a decent amount of vram for less than 700-1500 dollars (well, truly, you will not get close to 32gb even).
Makes me wonder just how much secret effort is being put in by MAG7 to strip NVDIDA of this pricing power because they are absolutely price gouging.
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
They had a plan.
reply