Width in a flame graph is directly proportional to runtime. Optimizing a block that covers x% of the graph will only speed up the program by x% or less, so probably dont bother with blocks less than 0.5% wide.
This by itself should already tell you what NOT to optimize.
But really, you should be looking for operations that take a long time but shouldn't (wide blocks that should be thin). To find it you need to have an intuitive idea of how fast things should be beforehand.
If you have no idea how fast things should be, no amount of graphs will help you with this.
They meant that different characters don't change the performance, unlike the Linux and MacOS versions, where different characters go through different code paths.
But yes, a bigger file does take longer to process.
Will be essentially the same performance. A box filter has to add one pixel entering the kernel and remove one pixel exiting the kernel. This one has one pixel entering and leaving each half of the kernel and then combining the two halfs. So three box blurs have six additions and once this filter is also six additions. Maybe the box filter could be somewhat slower because it has to increment two pointers three times while stack blur has to increment three pointers once and also incurs the loop overhead only once. For a fixed radius you could however use only one pointer and realize the offsets between the pointers with an appropriate addressing mode if available on the target architecture.
Register allocation is not on the roadmap, I think it's too hard for this series... But maybe I just haven't figured out an easy enough way to do it yet :^)
Memory allocation? You mean making a heap allocator? In the spirit of lowering the bar for creating new languages, I lean more towards calling into libc's malloc instead of building a custom solution haha
Thanks for the response! I was joking a bit (though not entirely - I really like dynamic code generators and am inspired by tiny, powerful systems) but as I mentioned I like the approach and hope that you get the chance to finish the series.
> There are push and pop instructions which are perfectly suited to this nested value use. In the past, lamenting how many compilers don't seem to realise that (and the fact that push/pop are specially optimised on x86 via hardware known as the stack engine)
That's fair, but in the article I also store local variables relative to %rsp, so I wouldn't want it changing at all
I could use %rbp for that instead, but then I'd have to explain it in the article :^)
By the way r=6 is also a solution, if we treat the diagram a bit more abstractly.