More

sebastianmestre · 2024-10-13T14:15:23 1728828923

Is there a more direct or elegant solution than reducing it to a quadratic equation?

By the way r=6 is also a solution, if we treat the diagram a bit more abstractly.

beardyw · 2024-10-13T14:54:13 1728831253

Yes, couldn't visualise that!

pm2222 · 2024-10-13T15:06:13 1728831973

sebastianmestre · 2024-06-28T12:24:36 1719577476

Width in a flame graph is directly proportional to runtime. Optimizing a block that covers x% of the graph will only speed up the program by x% or less, so probably dont bother with blocks less than 0.5% wide.

This by itself should already tell you what NOT to optimize.

But really, you should be looking for operations that take a long time but shouldn't (wide blocks that should be thin). To find it you need to have an intuitive idea of how fast things should be beforehand.

If you have no idea how fast things should be, no amount of graphs will help you with this.

sebastianmestre · 2024-06-20T16:27:46 1718900866

Yeah, Sean Barrett has a good writeup about them and optimizing them for branch-prediction, etc

It's framed in the context of programming-language tokenization, but the principles are the same.

https://nothings.org/computer/lexing.html

sebastianmestre · 2024-06-20T15:17:13 1718896633

They meant that different characters don't change the performance, unlike the Linux and MacOS versions, where different characters go through different code paths.

But yes, a bigger file does take longer to process.

sebastianmestre · on Nov 10, 2023

In my experience, 3 passes of box blur looks good enough for small kernels (less than 50px). Is this faster than that?

danbruc · on Nov 10, 2023

Will be essentially the same performance. A box filter has to add one pixel entering the kernel and remove one pixel exiting the kernel. This one has one pixel entering and leaving each half of the kernel and then combining the two halfs. So three box blurs have six additions and once this filter is also six additions. Maybe the box filter could be somewhat slower because it has to increment two pointers three times while stack blur has to increment three pointers once and also incurs the loop overhead only once. For a fixed radius you could however use only one pointer and realize the offsets between the pointers with an appropriate addressing mode if available on the target architecture.

This is the box filter inner loop.

  sum += data[right++] - data[left++]

This is the stack blur inner loop.

  lsum += data[middle]  - data[left++]
  rsum += data[right++] - data[middle++]

  sum += rsum - lsum

sebastianmestre · on Nov 10, 2023

Parsing and symbol resolution on simple languages is kinda trivial in my opinion, but I'll probably write some articles about that stuff later

Control flow and pointers are coming in the next part, and it's still surprisingly easy

phtrivier · on Nov 10, 2023

Is memory / register allocation in your roadmap too ? (honest question, I'm not sure it's in the scope of your articles)

sebastianmestre · on Nov 13, 2023

Register allocation is not on the roadmap, I think it's too hard for this series... But maybe I just haven't figured out an easy enough way to do it yet :^)

Memory allocation? You mean making a heap allocator? In the spirit of lowering the bar for creating new languages, I lean more towards calling into libc's malloc instead of building a custom solution haha

sebastianmestre · on Nov 10, 2023

That's right!

sebastianmestre · on Nov 8, 2023

The article describes AST -> Assembly transformation, no?

sebastianmestre · on Nov 8, 2023

Author here. I wanted to get the bar for writing a compiler as low as possible, and emitting assembly code seemed better in that regard.

Also I don't have time to learn x86 instruction encoding right now :(

musicale · on Nov 10, 2023

Thanks for the response! I was joking a bit (though not entirely - I really like dynamic code generators and am inspired by tiny, powerful systems) but as I mentioned I like the approach and hope that you get the chance to finish the series.

sebastianmestre · on Nov 8, 2023

> There are push and pop instructions which are perfectly suited to this nested value use. In the past, lamenting how many compilers don't seem to realise that (and the fact that push/pop are specially optimised on x86 via hardware known as the stack engine)

That's fair, but in the article I also store local variables relative to %rsp, so I wouldn't want it changing at all

I could use %rbp for that instead, but then I'd have to explain it in the article :^)

userbinator · on Nov 8, 2023

That's fair, but in the article I also store local variables relative to %rsp, so I wouldn't want it changing at all

You only need to keep track of how much you've put on the stack (which can be a single counter) and adjust the offsets appropriately.

But then you'd have to explain that too...