Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Collections-C, generic data structures for C (github.com/srdja)
115 points by marxo on March 13, 2016 | hide | past | favorite | 60 comments


I try and plug http://concurrencykit.org/ whenever I see someone looking for something like this. It's built around providing portable atomic intrinsics for parallel code, but the data structures are also very well done and work just as nicely for single threaded code.


Yes, ck and preshing https://github.com/preshing/junction are state of the art. This one really not.


Does it include any sort of a concurrent heap into shared memory?


Nice, but it's rather naive and through that - needlessly wasteful.

It's C.

Why on Earth you'd want to allocate a separate list node for each piece of data when you can embed this node directly into the data and then use container_of or similar offsetof() derivative to get a pointer to the data by a pointer to a list item? Saves you at least sizeof(void*) per item and eliminates a chance of list_add ever failing among other things.

The same goes for custom allocators - why would you drag around pointers to malloc, calloc and free, when all you need is just a pointer to realloc?

This sort of thing. The code is nice, but it's not how one would write a container library after few years of hands-on C experience.


allocate a separate list node for each piece of data

So many programmers don't even consider "overhead" of data when writing things. But, most of the time it doesn't matter. Do you need a list of ten things? Great. Do whatever. Do you need a list of a billion things? Then you need to rethink everything from the bottom up.

when all you need is just a pointer to realloc

Wrong! https://github.com/Tarsnap/libcperciva/commit/cabe5fca76f6c3...


> So many programmers don't even consider "overhead" of data when writing things.

Because they're using managed languages where the amount of overhead per object can't be reduced to zero. And also because, in exchange for that overhead, they get other benefits, like compacting garbage collection. But you don't get this in C.


> Wrong! https://github.com/Tarsnap/libcperciva/commit/cabe5fca76f6c3....

No, of course not "Wrong!"

If a user of a library wants to supply their own allocator, they must provide a version of realloc that acts as free() for the (p,0) case. That's the semantics the library expects, observe them. Internally the lib may look at whether a custom realloc is set, and fallback to malloc/free if it's not.


"Why on Earth you'd want to allocate a separate list node for each piece of data when you can embed this node directly into the data"

It's a good approach when the items may need to be in several lists at once.


You can certainly put the payload into multiple lists. From http://www.crashcourse.ca/introduction-linux-kernel-programm...:

> More to the point, a payload is welcome to participate on several lists, but only after defining an internal list_head object for each list. In other words, unlike the payloads you're used to that contain only pure data and could (theoretically) be on an infinite number of lists at once, objects on kernel lists can be only on those lists for which they have an internal set of links. It's an interesting way to lock down what lists you're willing to be a member of.


Sorry, yes, rather the problem comes from a sufficiently dynamic set of list memberships.


A quick look shows this is storing void* pointers only. My excitement was quickly over as I was expecting something with _Generic macros or ability to manage memory for any type.

https://github.com/srdja/Collections-C/blob/master/src/hasht... shows the container is manually configurable. Look at the defines at the bottom. This would mean the container file source-header pair, would have to be duplicated for every type, which is somewhat manageable.

Why isn't this written in the readme?


<shameless self promotion> Take a look at this instead. Work started on this before C-generics came about, so there is pretty heavy macro usage, but all data structures are type safe: https://github.com/mgrosvenor/libchaste


I see you use C macro templates. I have been experimenting with containers that use those. The advantages are speed and full type safety, and disadvantages are long compilations times (which have to be done only once though), the need to define a new template for every type in (only one) separate file, code bloat, and somewhat long function names (I don't use member function pointers).

I'm not sure if all that is worth it.


I've not noticed the compile time increase to be honest. But I have noticed the type safety. A few times it has saved me from bugs that would have taken days to find if I was using void* casts. In my world (mostly high perf networking) I'm happy to pay a (small ~= unnoticeable) compile time cost if I get both safety and performance.

To make it more user friendly, I've provided (yet more) macros to make defining new types super easy. Using the macros does not necessarily involve adding a separate file, although the code is much cleaner if you do. And I don't have anything against adding extra files. There's almost no cost to me. I often group a couple of vectors or linked lists into a single .c/.h file pair to keep things manageable.

The code bloat you get is on par with using C++ templates. In fact, I'd say that it is probably better because the type safe macro's all devolve to a void* underlying implementation rather than generating separate instances. So it's pretty thin. You can always throw away the standard types list that I provide so that you don't have to pay for anything you don't use. And if you really want to, you can interface to the void* underlying implementation directly, loose the indirection costs, the function pointers and the macro costs. Basically, you can choose which world you want to live in, or live in both at the same time.

The long function names are handled by macros again. And since i used function pointer indirection you almost never see them. The costs with this sort of thing are minimal. A typical invocation looks something like

  CH_VECTOR(MYVECCLASS)* myvec =  NEW_CH_VECTOR(MYVECCLASS);
  myvec->append(myvec,object);

There are certainly tradeoffs, the implementation is not nearly as mature as I'd like and debugging problems inside the containers is a HUGE PITA. But in my experience it lets me kee the benefits of working in C (can pull inside the kernel, can port to different machines, easy control over memory and performance) and gives me the flexibility of doing higher level things when I want.


Would be nice to have parallel structures like "Intel threading building blocks" provides.

https://en.m.wikipedia.org/wiki/Threading_Building_Blocks

I recommend this book to understand such structures:

https://www.goodreads.com/book/show/14788830-structured-para...


I'll definitely look into it.


Hello everyone, I'm the author of this library. I was quite surprised to find it on the front page of HN! I must thank you for all the great feedback, it's been very helpful. I agree that it has rough edges and that it also lacks some useful features but I guess this is mostly because I was the only user of it.

I very much welcome your suggestions on how to make better and more useful. :)



klib [1] is another good one I recently came across. In fact, I'll just leave this great curated list of free software libraries right here:

https://notabug.org/koz.ross/awesome-c

[1] https://github.com/attractivechaos/klib


<shameless self promotion> If you're interested in this sort of thing, take a look at https://github.com/mgrosvenor/libchaste. Work started on this before C-generics came about, so there is pretty heavy macro usage, but all data structures are type safe rather than the "cast to void*" approach which is common in this space. It's a bit less mature, but reasonably well covered by unit tests where it matters.


I tried reading the documentation, which is awfully sparse, to see how to use the hash table and if it would be a good replacement for the current one in cyrus-imapd.

The destroy function in cyrus-imapd has both a destroy that takes a cleanup function for values, and an interator:

https://github.com/brong/cyrus-imapd/blob/master/lib/hash.c

I couldn't see how to do either with this library until reading the source code I see you can set a mem_free at creation instead.

The documentation didn't really describe that, and the example is actively wrong.

I see from the code that there's a way to get the keys, from which you can do an external iterator. It's less efficient, but it works. You wouldn't know that from the documentation though, you have to read the source code.


Any chance you're going to submit a pull request to fix the documentation? I'd like to check this library out for use in a project, but I'm not sure I'd get so far as you have with regards to matching the documentation to the sources ..


From hashtable (https://github.com/srdja/Collections-C/blob/master/src/hasht...)

> HashTableConf htc;

Hold short, that structure is allocated from the stack and not via malloc? I'd expect that one to blow up after the second call to hashtable_new.

edit: seems like htc is immutable and the values copied over in hashtable_new_conf, but that's downright scary in case one forgets this semantic and accidentally uses htc for something.


It looks like all collections store void pointers. I don't see any abstractions to help with pointer lifetimes, which remains the main problem I encounter when trying to write large programs in C that dynamically allocate memory.


Start using arenas wherever you can. Makes memory management in C programs substantially simpler.


Any good readings on getting started with arenas?



I'm no expert here, just curious. I've seen this brought up a couple times now: what's wrong with void pointers? How else would you do this in C? If you wanted abstractions to help with pointer lifetimes wouldn't you just be better off using C++?


The limitation with void pointers is that it can only inline values <= the size of a pointer. If you have something larger then you need to allocate it separately which may have cache efficiency issues.

The main alternative, other than hardcoding an element type, is to use macros and pass the element type in the definition.

Sometimes you just don't have the option of a C++ compiler. Also, and this was a situation I was in recently, I needed a collection in a small area of my C code so it really didn't warrant a toolset change.


Void pointers mean no type safety, preventing a lot of potential compiler optimizations and static analysis checks. 99% of the time the right thing to do is just create a new type:

  typedef struct Strref_t {
    char* str;
    size_t ref;
  } Strref;
There, just pass a pointer of that type around.


It looks like all collections store void pointers

Does that mean that if you're running as 32bit and want to store a number larger than that (a double or int64_t or so) or a POD that all of these have to be allocated seperately?


Has this library been stress tested? I am interested in using this library for a spaceflight application, but reliability is paramount in our situation.


I stopped looking when I saw it doesn't check for malloc failure. I am going to say space flight is a no-go.


This is quite common because the alloc() call itself would panic if out-of-memory.

Remember, a large proportion of C code is for embedded/real-time systems where the only sensible recovery option is to reset.

Attempting to 'handle' an error manually in every case is simply not possible. Better to reset and come up in a clean state than propagate errors.

Having said that, dynamic memory allocation is itself frowned upon in embedded systems.


But for this kind of application, wouldn't you just panic and do a soft restart on malloc failure anyway? Actually recovering from an out-of-memory situation is incredibly fraught and unreliable; most of the realtime operating systems I've used don't even bother to try.


That depends on when a failure were to occur. If we have a failure while the spacecraft is en route to its destination, perhaps we can recover because we may have enough time.

If we have a failure during rendezvous with our target, it could be a very bad day.


...necropost, but:

You always need to be able to recover from a soft reboot, even during maneuvers; you're in a high radiation environment and any passing high energy particle or cosmic ray can trigger this.


Dynamic memory allocation is extremely frowned-upon in embedded/real-time systems such as flight systems for exactly that reason. You simply cannot handle out-of-memory and fragmentation issues in a safe manner.


Even then you'd want proper logging and panicking over a segfault (the likely result).


In such an environment, malloc() will do that.


Thank you. There is somebody else on this fucking planet who understands that memory is not limitless manna from heaven.


Ouch. I'll keep looking then.

Any recommendations for reliable C data structures?


Don 't take this personally, but it worries me that someone who is not well versed with things like safe memory structures is tasked with writing software like this.


I'm not sure what aspect of asking about the level of testing performed on an open source library gives you that idea, but sure: no offense taken.


Don't dismiss it because of that, handling OOM errors is best handled by the alloc call by panicing.

If that situation is not allowed, I would suggest you shouldn't be performing memory allocation (or any other resource allocation) dynamically.


Some of the functions do check for malloc failure and others don't. It's on github--Have you submitted a pull request? Wow--guess the idea of sending a patch to an open source project is now controversial.


Give Ada a try


[Un]fortunately, we are tied to C/C++ due to mission requirements and our hardware platform.


nice. I actually built something very similar a while ago. although I only supported lists, queue and stacks... https://github.com/mellowcandle/liblist


Something like this should be in standard C.


Take that, Golang!!!!

J/K


Honestly, why? Most of the time, you want to write C++, with exceptions, and the STL. You can make this kind of programming as robust as you want against memory allocation failure.

If you don't want to use this style of programming, for whatever reason, check out sys/queue.h. It's already on your system, if you're using some kind of Unix.


If you are using an RTOS (which many timing-critical applications run on), you should be worried about memory allocations happening under the hood.

If you don't have full control over when memory is being allocated/reallocated, your system is now non-deterministic.

With pure C, you can know exactly when those few extra instructions for resizing your dynamic array are going to happen.


Same in C++. There's `std::vector::reserve`, which grows the vector's underlying physical buffer, without logically adding any new elements. If you `reserve` enough capacity before inserting anything, it's even a `O(1)` operation.


Vector and possibly string have that. The vast majority of STL structures do not.


All containers from the C++ standard library (please don't call it STL, that's Stepanov's original library) can be parameterized by an allocator. You can use whatever allocation policy you like best. However, most people use the default allocator because it's good enough.

In any case, while C++ has lots of defects, “loss of control relative to what C gives you” isn't one of them.


(a bit of topic - but I'm humbly trying to learn more...) I've got zero experience writing allocators. Is there some common ones you use provided somewhere? Do you write your own? (in which case can you point me to where to learn to do that properly)


I don't write allocators myself, but Boost has a pool allocator library [http://www.boost.org/doc/libs/1_60_0/libs/pool/doc/html/inde...], which conforms to the Allocator concept defined in the standard library [http://en.cppreference.com/w/cpp/concept/Allocator].


Thank you


Yeah, that's true.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: