Seems *much* more restricted than perf or DTrace, or all the other tracing tools...

asuffield · on May 2, 2016

(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google)

I recommend reading page 2 of the paper, which discusses the specific set of features that XRay offers. How do I get perf or DTrace to give me the six things listed there? I can only think of ways to get a couple of them.

dap · on May 2, 2016

> The cost is acceptable when tracing and barely measurable when not tracing.

"Acceptable" is obviously relative, but with DTrace's pid provider, the cost is zero when not tracing, and about the cost of a fastcall per probe point when enabled.

> Instrumentation is automatic and directed towards functions that are important for understanding the binary’s execution time.

I'm not sure what this means, but with DTrace, you enumerate the functions or binary objects (with wildcards and such) that you want to instrument, and the framework takes care of reliably instrumenting them, no matter the state of the process. Is that "automatic" and "directed"? I need to read the rest of the paper more closely.

> Tracing is efficient in both space and time -- only recording what is required and what matters.

DTrace records exactly what you ask it to. It supports in-situ aggregation for cases where it's not tenable to record a complete log of all interesting events. This is an important part of the design.

> Tracing is configurable with thresholds for storage (how much memory to use) and accuracy (whether to log everything or only function calls taking at least some amount of time).

With DTrace, it's pretty easy to filter on function execution time. The buffer size is configurable. There are also multiple buffer policies for different use-cases (e.g., ringbuffer of the last N events leading up to some other event).

> Tracing does not require changes to the operating system nor super-user privileges.

If they're running Linux, as I imagine they are, DTrace isn't necessarily an option. Several other platforms have just ported it. Using it to record user-level state on your own processes does not require superuser privileges.

> Tracing can be turned on and off dynamically without having to restart the server.

Absolutely -- that's what the "D" is for.

I'd strongly recommended checking out the DTrace paper: https://www.usenix.org/legacy/event/usenix04/tech/general/fu...

There may be good reasons not to use DTrace for this, but I'm not sure which of those six goals would be the sticking point other than OS availability. (edit: I also haven't read beyond that yet!)

evmar · on May 2, 2016

It's definitely the sort of paper that would get reviewer feedback if it was submitted to a conference, in that it failed to compare any related work. (But it's unfair to grade it by that metric because it reads like more of a "here's what we did" blog post, not a research paper.)

From reading it, I believe no other tool is both (1) non-sampled and (2) supports instrumenting all functions simultaneously. (At least none of the other comments in this thread point at tools that do this.)

cthalupa · on May 2, 2016

>From reading it, I believe no other tool is both (1) non-sampled and (2) supports instrumenting all functions simultaneously. (At least none of the other comments in this thread point at tools that do this.)

Perf supports dynamic tracing of probepoints, and probing multiple points simultaneously.

Perhaps I'm misunderstanding things, but I don't see how perf doesn't fit those two requirements. It does a lot more than just sampling

DannyBee · on May 2, 2016

1. Errr, perf definitely does not support inserting N probepoints, where N is the number of function entry + exit in a program.

2. Perf is 14x slower with tracepoints, and is definitely not that low overhead with tons and tons of probe points.

See http://events.linuxfoundation.org/sites/events/files/slides/...

cthalupa · on May 3, 2016

Is there something I'm missing in the XRay paper that mentions speed? The only overhead I can find mentioned in the paper is CPU and memory utilization.

If XRay is able to trace millions of probe points with less overhead than uprobe + frontends, that's extremely impressive... But I just don't see any numbers so far that have said this is the case?

DannyBee · on May 5, 2016

Errr, it's tracing every function. That is equivalent to tracing every probe point. It is definitely less overhead than something that has to actually figure out what it is doing and care in some special way.

x-ray doesn't even make any attempt to optimize things well. If it did, it would be even lower overhead.

I'm honestly a bit stymied. I'm not sure why anyone would think a dynamic probing infrastructure that has to keep track of millions or billions of probes would ever go faster than something that has constant time and storage space by knowing what it is tracing and that it is tracing ahead of time.

The tradeoff is pretty simple. If you know what you want to probe ahead of time, and what the probes do, which is the case with x-ray, you can do it in fixed time and storage cost. You can never add or remove probes. Nothing has to worry about what probes exist.

If you want to be able to add or remove probes at any time, you now have a cost of having to keep track of all the probes that exist and possibly doing something different with them at different times and ...

DannyBee · on May 2, 2016

So, to be fair to the xray people, what they have done meets a very specific set of use cases you can't meet with most other tracing tools

(I haven't looked at dtrace hard enough. at a glance, it looks like it would not meet several things, but dtrace is not a sane option for them for a variety of reasons).

Past that, the mechanism used to do this, and the system itself, is not uncommon to build. I don't see any claims that say they think they've broken new ground. Just that they built it :)

bitmapbrother · on May 2, 2016

Does perf or Dtrace have the same functionality?

>XRay allows you to get accurate function call traces with negligible overhead when off and moderate overhead when on, suitable for services deployed in production. XRay enables efficient function call entry/exit logging with high accuracy timestamps, and can be dynamically enabled and disabled.

4ad · on May 2, 2016

DannyBee · on May 2, 2016

First, i always assume dtrace can do something. Dtrace is not that low overhead in the situation that xray is describing.

http://dtrace.org/blogs/brendan/2011/02/18/dtrace-pid-provid...

When they traced everything, it was two orders of magnitude slower.

At 100k event/s (quite possible in the situations x-ray is used), the app would be about 60% slower. For just a simple probe, not full on tracing.

Does this make it better than dtrace? No. It just serves a different set of use cases.

dap · on May 2, 2016

> When they traced everything, it was two orders of magnitude slower. > At 100k event/s (quite possible in the situations x-ray is used), the app would be about 60% slower. For just a simple probe, not full on tracing.

I think you've misunderstood the post. The actual time reported in that post is 600 ns per probe, with sanity checks reporting as much as 2000ns, which he concluded was in the same ballpark.

He measured that by observing a program that just calls two functions in a tight loop, one of which is strlen() on a dozen-character string. That's basically the worst possible case for any function call tracing system -- I don't think it's fair to say that's "a simple probe, not full on tracing".

The "60%" slower conclusion is only by construction: if you have N events per second, and the framework adds overhead of T nanoseconds per probe, then you can pick N such that the tracing framework adds whatever percentage overhead you like. In this case, Brendan picked N=100K and came up with 60% overhead for that case, but I think there's a math error in there. For that calculation, he assumes a 6us probe time instead of 600ns. I think the overhead would be 6% for 100,000 events, not 60%.

DannyBee · on May 3, 2016

I didn't misunderstand. The problem with this is that you assume that if you probe every function instead of just strlen that you get the same time bounds.

The only way to do this is, afaik, something similar to http://docs.oracle.com/cd/E19253-01/819-5488/gcgmc/index.htm...

This really just is going to expand to putting a probe on every function. I'd really like to see numbers on how dtrace handles many millions of probes at once (which is what xray is handling).

It's not uncommon to have 10 or 100 million functions in some of these programs. I have strong doubts that dtrace has the same overhead given 200 million probes.

(since the probe data structures alone will likely take up gigabytes of memory, accessing them is unlikely to be cache friendly, etc)

AFAICT, there is no more generic function entry probe than what that blog post describes. But i'd love to be wrong, and understand how dtrace is going to determine what instructions are a function entry in several ns :P

TL;DR dynamic probing infrastructures are not a panacea

dap · on May 3, 2016

> I'd really like to see numbers on how dtrace handles many millions of probes at once (which is what xray is handling). It's not uncommon to have 10 or 100 million functions in some of these programs. I have strong doubts that dtrace has the same overhead given 200 million probes.

I see now. That's a fair question, and I'm not aware of data either way. I just tried a pretty simple experiment inspired by Brendan's that suggests that on my machine, the overhead is about 1450ns per probe for as many as 140,000 probes:

https://gist.github.com/davepacheco/a12a0d45d55f0d7a28c312c2...

> AFAICT, there is no more generic function entry probe than what that blog post describes. But i'd love to be wrong, and understand how dtrace is going to determine what instructions are a function entry in several ns :P

Well, DTrace as architected is always going to pay the cost of a context switch into the kernel for each probe, and I think it's fair to take Brendan's result of 600ns as a lower bound of the per-probe overhead, at least on his machine. However, once in the kernel, for a typical native program (i.e., not JIT), I expect DTrace would only record the current userland thread instruction pointer. Names are typically resolved asynchronously by the consumer. So I would be surprised if it really was much slower, especially given the result above, but I too would like to see data.

I'm not saying that DTrace solves all problems or even that the OP should have used it instead. It's certainly true that for the special case of userland function boundary tracing, one might expect to do better by skipping the context switch (at the expense of much functionality, including any ability to correlate with broader system activity). But since DTrace was brought up, I wanted to help clarify the uncertainty about what it can do and what its overhead is.

teraflop · on May 2, 2016

I don't know about DTrace, but perf is a sampling profiler, not an accurate call tracer.

4ad · on May 2, 2016

Perf is much more than a sampling profiler: https://perf.wiki.kernel.org/index.php/Main_Page. It can use uprobes and kprobes.

There are many ways to do dynamic tracing in Linux, apart from perf. Ftrace, raw kprobes/uprobes, eBPF, bcc compiler for eBPF, etc.

teraflop · on May 2, 2016

After perusing the perf man page, the only way I could figure out how to make it accurately count userspace function calls was using hardware breakpoints, e.g.: "perf stat -e mem:0xADDRESS:x"

Obviously that's not a very good approach, because you're limited to the number of breakpoints that your CPU can handle simultaneously (4 on my machine) and there's a lot of overhead. If you know of a better way to accomplish the same thing with perf, I'd be happy to hear it.

cthalupa · on May 2, 2016

https://lwn.net/Articles/499190/ & https://gnu.wildebeest.org/blog/mjw/2012/05/24/pull-user-spa...

http://www.brendangregg.com/blog/2015-06-28/linux-ftrace-upr...

teraflop · on May 2, 2016

Ah, cool. I just tested it out and it seems to work as documented. Unfortunately it requires root access, and incurs about 1 microsecond of overhead per function call on my machine.

cthalupa · on May 2, 2016

Reading the X-Ray abstract, unless I'm missing it, it doesn't go into detail about the time overhead. I do see mention of CPU and RAM usage, though.

I'm not sure that we have any evidence that X-Ray offers less overhead here, though it's certainly cool if it does.

DannyBee · on May 2, 2016

Can you give an example where perf works with 10 million+ probe points?

:)

(and where this does not have an antagonist effect on the other things on the machine?)

wmf · on May 2, 2016

Check out http://www.brendangregg.com/perf.html#DynamicTracingEg

SEJeff · on May 2, 2016

And for those poor people who haven't seen bcc, do yourself a favor and look into it: http://iovisor.github.io/bcc/

Besides, with a unicorn for the logo, whats not to like?