Hacker News new | past | comments | ask | show | jobs | submit login

This seems like a great example of why people don't like C/C++, and probably a good example of why some people _do_ like it.

How is a non-expert in the language supposed to learn tricks/... things like this? I'm asking as a C++ developer of 6+ years in high performance settings, most of this article is esoteric to me.




They have a computation to do, and they want to do it using a particular instruction. That's the only reason they are fiddling with the compiler. You would have to do so too, if you had decided to solve a problem at hand using a similar method.

The author is talking about a way to get a particular (version of) C/C++ compiler to emit the desired instruction. So I'd call this clang-18.1.0-specific but not C/C++-specific since this has nothing to do with the language.

Also such solutions are not portable nor stable since optimization behavior does change between compiler versions. As far as I can tell, they also would have to implement a compiler-level unit test that ensures that the desired machine code is emitted as toolchain versions change.


Is this really C++ specific though? It seems like the optimisations are happening on a lower level, and so would 'infect' other languages too.

Whatever the language, at some point in performance tweaking you will end up having to look at the assembly produced by your compiler, and discovering all kinds of surprises.


LLVM isn't perfect, but the problem here is that there's a C++ compiler flag (-ffast-math) which says OK, disregard how arithmetic actually works, we're going to promise across the entire codebase that we don't actually care and that's fine.

This is nonsense, but it's really common, distressingly common, for C and C++ programmers to use this sort of inappropriate global modelling. It's something which cannot scale, it works OK for one man projects, "Oh, I use the Special Goose Mode to make routine A better, so even though normal Elephants can't Teleport I need to remember that in Special Goose Mode the Elephants in routine B might Teleport". In practice you'll screw this up, but it feels like you'll get it right often enough to be valuable.

In a large project where we're doing software engineering this is complete nonsense, now Jenny, the newest member of the team working on routine A, will see that obviously Special Goose Mode is a great idea, and turn it on, whereupon the entirely different team handling routine B find that their fucking Elephants can now Teleport. WTF.

The need to never do this is why I was glad to see Rust stabilize (e.g) u32::unchecked_add fairly recently. This (unsafe obviously) method says no, I don't want checked arithmetic, or wrapping, or saturating, I want you to assume this cannot overflow. I am formally promising that this addition is never going to overflow, in order to squeeze out the last drops of performance.

Notice that's not a global flag. I can write let a = unsafe { b.unchecked_add(c) }; in just one place in a 50MLOC system, and for just that one place the compiler can go absolutely wild optimising for the promise that overflows never happen - and yet right next door, even on the next line, I can write let x = y + z; and that still gets the kid gloves, if it overflows nothing catches on fire. That's how granular this needs to be to be useful, unlike C++ -ffast-math.


You can set fast math (or a subset of it) on a translation unit basis.


Because the language works by textual inclusion and so a "translation unit" isn't really just your code this is much more likely to result in nasty surprises, up to and including ODR violations.


Yes, if you try hard enough I'm sure you can find ways to screw up.

From a practical point of view it is fine.


That's not really true. If you link a shared library that was compiled with -ffast-math, that will affect the entire program. https://moyix.blogspot.com/2022/09/someones-been-messing-wit...


To enable this "feature" I believe you have to specify fast math while linking the .so, it is not enough to do it when compiling.

It had also been fixed recent GCC versions.


Or even on a per function basis, at least with gcc (no clue about clang...)


In principle yes, you can use Attribute optimize, but I wouldn't rely on it. Too many bugs open against it.


It's worked for me in the past but maybe I got lucky.


LLVM actually also supports instruction-level granularity for fast-math (using essentially the same mechanism as things like unchecked_add), but Clang doesn't expose that level of control.


clang does have pragma clang fp to enable a subset of fast math flags within a scope


If you're using floating point at all you have declared you don't care about determinism or absolute precision across platforms.

Fast math is simply saying "I care even less than IEEE"

This is perfectly appropriate in many settings, but _especially_ video games where such deterministic results are completely irrelevant.


Actually, floating-point math is mostly deterministic. There is an exception for rounding errors in transcendental functions.

The perception of nondeterminism came specifically from x87, which had 80-bit native floating-point registers, which were different from every other platform's 64-bit default, and forcing values to 64-bit all the time cost performance, so compilers secretly turned data types to different ones when compiling for x87, therefore giving different results. It would be like if the compiler for ARM secretly changed every use of 'float' into 'double'.


The existence of a theoretically pure floating-point behavior doesn't have any impact on the reality of the implementations we program for.


> This is perfectly appropriate in many settings, but _especially_ video games where such deterministic results are completely irrelevant.

I'm not sure I'd agree. Off the top of my head, a potentially significant benefit to deterministic floating-point is allowing you to send updates/inputs instead of world state in multiplayer games, which could be a substantial improvement in network traffic demand. It would also allow for smaller/simpler cross-machine/platform replays, though I don't know how much that feature is desired in comparison.


Indeed, but this isn't hypothetical. A large fraction of multiplayer games operate by sending inputs and requiring the simulation to be deterministic.

(Some of those are single-platform or lack cross-play support, and thus only need consistency between different machines running the same build. That makes compiler optimizations less of an issue. However, some do support cross-play, and thus need consistency between different builds – using different compilers – of the same source code.)


The existence of situations where the optimization is not appropriate, such as multiplayer input replay, does not invalidate the many situations where it is appropriate.


Sure, but I'm not sure anyone is trying to argue that -ffast-math is never appropriate. I was just trying to point out that your original claim that deterministic floating point is "completely irrelevant" to games was too strong.


Everything in this post applies to C too, so it's not C++ specific. And the same gotchas apply for every case when you use inline assembly. I wouldn't call it a trick... Just an interesting workaround for a performance bug in Clang.

The post can be boiled down to "Clang doesn't compile this intrinsic nicely, so just use inline asm directly. But remember that you need to have a non-asm special case to optimize constants too, and you can achieve this with __builtin_constant_p".


I wouldn't call this a performance bug in clang. It's an optimization working as intended.


I would challenge you to find a processor on which the rsqrt plus two newton-raphson iterations is not slower than plain sqrt. (We don't know what mtune the author used)


According to Intel, any processor before Skylake (section 15.12 from [1]).

[1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...


The author probably didn't use any mtune setting, which is likely the problem. If you look at older cores on Agner's instruction tables, SQRT has been getting steadily faster over time. This implementation is slightly faster on old Intel machines, for example.


A lesson I learned very early on: don't use -ffast-math unless you know what it does. It has a very appealing name that suggests nice things. You probably won't ever need it.


I prefer -funsafemath. Who doesn't want their math to be fun and safe?


I don't know how many people got your joke, but that flag actually says unsafe math. The -f prefix signifies a floating point flag.


I am pretty sure the -f is for feature. Because there are flags like -fno-exceptions and -fcoroutines


It's a feature flag, they're not all necessarily floating point related. As an example though, I still intentionally humorously misread -funroll-loops as funroll loops even though it's f(eature) unroll loops


The f in that case stands for fruit


And even if you do know what it does, it’s very impractical given that it’s a global option. Turning it on for selected portions of code would be a completely different game.


You're mistaken.

You can/would just use it in the translation units where you want it; usually for numerical code where you want certain optimizations or behaviors and know that the tradeoffs are irrelevant.

It's mostly harmless for everday application math anyway, and so enabling it for your whole application isn't a catastrophe, but it's not what people who know what they're doing would usually do. It's usually used for a specific file or perhaps a specific support library.


-ffast-math affects other translation units too, because it introduces a global initialization that affects some CPU flags. You can't really contain it to a single TU.


It seems like this depends on whether you specify -ffast-math when linking or only when compiling: https://stackoverflow.com/a/68938551

My understanding is that if you don't specify -ffast-math when linking then you shouldn't get crtfastmath.o linked in.


so this is complicated by the fact that GCC has a bug until last year where it would set the subnormal truncation flag when it wasn't supposed to.


Julia has this (via a @fastmath macro) and it's so nice!


This is a Clang or even LLVM code generation issue and entirely unrelated to the C and C++ standards (-ffast-math, intrinsics and inline assembly are all compiler-specific features not covered by the standards).

Most other language ecosystems most likely suffer from similar problem if you look under the hood.

At least compilers like Clang also give you the tools to workaround such issues, as demonstrated by the article.


> How is a non-expert in the language supposed to learn tricks/... things like this?

just like everything else in life that's complex: slowly and diligently.

i hate to break it to you - C++ is complex not for fun but because it has to both be modern (support lots of modern syntax/sugar/primitives) and compile to target code for an enormous range of architectures and modern achitectures are horribly complex and varied (x86/arm isn't the only game in town). forgo one of those and it could be a much simpler language but forgo one of those and you're not nearly as compelling.


I'm a security student. My main experience has been python and java....but I have started to learn c to better learn how low level stuff works without so much abstraction.

My understanding is that C is a great language, but I also get that its not for everyone. Its really powerful, and yet you can easily make mistakes.

For me, I'm just learning how to use C, I'm not trying to understand the compiler or make files yet. From what I get, the compiler is how you can achieve even better performance, but you need to understand how it is doing its black magic....otherwise you just might make your code slower or more inefficient.


First order optimization is always overall program architecture, then hotspots, then fixing architectural issues in the code (e.g. getting rid of misspeculated branches, reducing instructions etc), and then optimizing the code the compiler is generating. And at no point does it require to know the internals of the optimization passes as to how it generates the code.

As for the compiler’s role in C, it’s equivalent to javac - it’s taking your source and creating machine code, except the machine code isn’t an abstract bytecode but the exact machine instructions intended to run on the CPU.

The issues with C and C++ are around memory safety. Practice has repeatedly shown that the defect rate with these languages is high enough that it results in lots of easily exploitable vulnerabilities. That’s a bit more serious than a personal preference. That’s why there’s pushes to shift the professional industry itself to stop using C and C++ in favor of Rust or even Go.


Nah, it is not that bad.

Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.

This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?


The main flags to look at:

* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific

* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)

* -flto=thin - get most of the benefits of LTO with minimal compile time overhead

* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.

* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.


Also I learned recently that there's `-Og` which enables optimizations suitable for debug build.


In practice I’ve had limited success with that flag. It still seems to enable optimizations that make debugging difficult.


Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.


Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.


Indeed, C++ is different from most languages (other than C) because "knowing C++" does not mean just knowing syntaxis and standard library API but implies understanding of how the source code is turned into bytes inside an executable image. A Python or Java or whatever programmer could be writing books and praised as the language expert without slightest idea how is memory allocated, a C++ programmer who does not know that is probably going to work in some legacy shop supporting an ancient MFC/Qt data-entry app.


In my opinion, `__builtin_constant_p()` is not that obscure of a feature. In C, it is used in macros to imitate constant functions, and in C++ it is useful for determining the current alternative that has lifetime in a `union` within a constant function. Granted that `__builtin_is_constant_evaluated()` has obsoleted its primary purpose, but there are enough ways it's still useful that I see it from time to time.


> How is a non-expert in the language supposed to learn tricks/... things like this?

By learning C and inline asm. For a C developer, this is nothing out of the ordinary. C++ focuses too much on new abstractions and hiding everything in the stdlib++, where the actual implementations of course use all of this and look more like C, which includes using (OMG!) raw pointers.


arguably vast-vast-vaaast majority of projects, problems, solutions, developers simply don't need this.

but yes, given the premise of the article is that a friend wants to use a specific CPU instruction, yeah, at least minimum knowledge of one's stack is required (and usually the path leads through Assembly, some C/Rust and an FFI interface - like JNI for Java, cffi/Cython for Python, and so on)


Not only that but this is so far away from the actual problem being solved that it’s just kind of annoying.

I wish there was some sensible way for code that’s purely about optimization to live entirely separated from the code that’s about solving the problem at hand…


I mean, you don’t have to care about this unless you have an application where you do. And if you do there is enough transparency (ie ability to inspect the assembly and ask questions) that you can solve this one issue without knowing everything under the sun.

If you had an application where this sort of thing made a difference in JavaScript, the problem would likely still the there, you’d just have a lot less visibility on it.

I guess you’re still right - at the end of the day you see discussions like this far more often in C, so it impacts the feel of programming in C more.


Take a computer architecture course for starters.


Got any you’d recommend?


for appetizer I recommend Cliff Click's "A Crash Course in Modern Hardware" talk

https://www.youtube.com/watch?v=5ZOuCuGrw48 (and here's the 2009 version https://www.infoq.com/presentations/click-crash-course-moder... .. might be interesting for comparison )


Thanks!


if u understand that c/c+ purpose was at first to write an OS...u somewhat are aware of this....but that would depend upon your CS classroom studies exposure...

In my case it was by accident as I picked up assembly and machine language before I touched C in the late 1980s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: