Web server with DB backend. Pause times can exceed 2000ms. No doubt some code ch...

straight-shoota · on Sept 29, 2024

I'm a core developer of Crystal. Looks like something went very wrong there. The GC may not be super optimised, but it's still practical. I have never heard about such drastic performance issues. And I'm aware of quite a few companies who use Crystal in heavy production loads for exactly the web server + db use case without such issue reports. So I'd suggest the root cause might be something else then the GC implementation.

cogman10 · on Sept 29, 2024

What is roughly the gc algorithm?

yxhuvud · on Sept 29, 2024

It is using Boehm/libgc. Just a simple webserver should not have the described behavior. The GC is not incremental though, so having a big heap would trigger it. But that is typically not the case for the described use case. Likely the issue is with doing something that involves more allocations than necessary.

There are works in libgc to allow incremental collection, but it is not yet ready for the needs of crystal (or at least it wasn't the last time I investigated).

cogman10 · on Sept 30, 2024

Can crystal use a moving GC or does it suffer from the same issues python has with C FFI?

Also, is it possible to use RC with crystal?

straight-shoota · on Sept 30, 2024

Yeah moving objects would invalidate pointers passed to external code that's not controlled by Crystal.

compumike · on Sept 29, 2024

Conceptually, I think the correct time to do garbage collection is when your web server process is idle.

My Crystal implementation of idle-time garbage collection is here: https://github.com/compumike/idle-gc though please note that its idle detection mechanism only works for single-threaded Crystal programs.

An analogy is to imagine a single employee (thread) operating a convenience store. If there are customers waiting in the checkout line (latency-sensitive requests), the employee should priorities serving the customers ASAP! But once the line is empty (thread is idle), that might be a good time to start sweeping.

Right now, with automatic garbage collection, the employee only decides to start sweeping the entire store while in the middle of serving a customer! (Because that's when mallocs are happening, which may trigger automatic GC.) Pretty ridiculous!

With idle-time GC, the sweeping happens entirely or mostly while there are no customers waiting. This may not show latency improvements in an artificial benchmark where the system is running flat-out with a full request queue, but in the real world, it changes GC from something that happens 100% of the time in the middle of a request is being served (because that's when mallocs happen and trigger automatic GC), to something that only rarely or never happens while a request is being served.

Even better would be to combine idle-time GC with incremental GC, so that the employee could put down the broom when a new customer arrives without finishing sweeping the entire store. :)

See also "Idle Time Garbage Collection Scheduling" in Google Chrome (2016): PDF at https://static.googleusercontent.com/media/research.google.c...