Core_bench: better micro-benchmarks through linear regression (2014)

jasim · on Oct 20, 2018

Couple of things I found interesting:

> 2: Time.now is an expensive function that takes a while to execute and typically requires control transfer to a VDSO. (On my older laptop this takes 800+ nanos to run. On more expensive server class machines, I have seen numbers as low as 40 nanos.)

Then there is system noise for which you still need a large sample size, and the result has a high variation and we haven't still accounted for GC.

So they do this:

> This brings us to how Core_bench works: Core_bench runs f in increasing batch sizes and reports the estimated time of f () by doing a linear regression. In the simplest case, the linear regression uses execution time as the predicted variable and batch size as the predictor. Several useful properties follow as a consequence...

While we're on the topic of benchmarking, what is the best way to to do micro-benchmarks for Javascript code (both Node and browser)? I tried node --prof, but the report is a little difficult to parse. Chrome's performance tab would've been helpful, but it doesn't seem to profile WebWorkers.

vardump · on Oct 20, 2018

> On more expensive server class machines, I have seen numbers as low as 40 nanos.

That's pretty unusual. Usually larger NUMA systems take longer to retrieve time than single socket consumer ones. They need to synchronize between multiple sockets.

dragontamer · on Oct 20, 2018

Unnecessary in the general case.

RDTSC is the assembly instruction needed to read the current timestamp for the current core (not necessarily synchronized to the whole system). Modern RDTSCs do NOT read the cycle-count, because frequencies change due to "Turbo Boosts" and other such optimizations. (See here for more details: https://randomascii.wordpress.com/2011/07/29/rdtsc-in-the-ag...)

A modern RDTSC reads at the "base clock". If your base clock is 3.6 GHz, the RDTSC ticks at 3.6GHz (even if your processor can turbo to 4GHz or 5GHz like the i9-9900k). So RDTSC is not the same from processor-to-processor, but it is consistently the most granular micro-benchmarking clock available on x86 systems.

The main issue with RDTSC is that task-switches may cause your thread of execution to change cores or mess up your timing, especially in long runs. So Windows / Linux have higher-level performance counters that take task switches into account, but have lower granularity. For Windows, this is "QueryPerformanceCounter", which ticks on my system at roughly 3MHz. Still useful for microbenchmarks, and the guarantee for cross-thread behavior is useful more often than not.

Microsoft documents "QueryPerformanceCounter" as using rdtsc in most cases. https://docs.microsoft.com/en-us/windows/desktop/dxtecharts/...

The rdtsc instruction seems to take ~40 clocks or so according to Agner Fog's instruction tables. Suggesting you can get a high-speed clock in as little as 10ns on a 4GHz x86 computer should you use the raw assembly instruction.

Add a few nanoseconds for a function call, setting up the stack, and some math to "normalize" the rdtsc clock... and the CPUID to clear out pipelines... and 50ns total seems reasonable.

There are some slower clocks available from the motherboard or I/O system. But those should be avoided unless you are running a 2008-era x86 processor (Aka: Nehalem/Westmere. The only Intel processor with "Turbo" frequencies which changed RDTSC timing). Older systems don't have turbo, newer systems lock RDTSC to the base clock.

------------

Anyway, I'm no Javascript guru. But surely there's a way to pass the RDTSC instruction "up" from the assembly level to Javascript code? Or alternatively, maybe an OS-level timer function that's built on top of RDTSC. (QueryPerformanceCounter in windows, or CLOCK_MONOTONIC_RAW in Linux)

vardump · on Oct 21, 2018

> Microsoft documents "QueryPerformanceCounter" as using rdtsc in most cases. https://docs.microsoft.com/en-us/windows/desktop/dxtecharts/....

And the cases when it doesn't is when it's running on more than two CPU sockets. Then it'll fall back on other timers that can take even microseconds to query.

DavidBuchanan · on Oct 20, 2018

> The main issue with RDTSC is that task-switches may cause your thread of execution to change cores or mess up your timing

I got around this by running my benchmarks in a kernel module, with interrupts disabled. Obviously this is only possible under certain circumstances.

I also disabled caching via the CR0 register for maximum repeatability, although of course that isn't at all reflective of "real world" performance, so it depends on what you're actually trying to measure.

twtw · on Oct 20, 2018

> The main issue with RDTSC is that task-switches may cause your thread of execution to change cores or mess up your timing

Task switches messing with timing, yes. Changing cores, on the other hand, is not so much of a problem by itself anymore. I'm pretty sure recent Intel CPUs guarantee that rdtsc is synchronized across cores.

jasim · on Oct 20, 2018

Thanks for the low-level perspective and the references! I used the wrong term though - I was looking for plain old statistical profiling on a function-level for running applications. This is available for Node, but the tooling doesn't seem to be very friendly.