Collections-C, generic data structures for C

kev009 · on March 13, 2016

I try and plug http://concurrencykit.org/ whenever I see someone looking for something like this. It's built around providing portable atomic intrinsics for parallel code, but the data structures are also very well done and work just as nicely for single threaded code.

rurban · on March 16, 2016

Yes, ck and preshing https://github.com/preshing/junction are state of the art. This one really not.

CyberDildonics · on March 13, 2016

Does it include any sort of a concurrent heap into shared memory?

eps · on March 13, 2016

Nice, but it's rather naive and through that - needlessly wasteful.

It's C.

Why on Earth you'd want to allocate a separate list node for each piece of data when you can embed this node directly into the data and then use container_of or similar offsetof() derivative to get a pointer to the data by a pointer to a list item? Saves you at least sizeof(void*) per item and eliminates a chance of list_add ever failing among other things.

The same goes for custom allocators - why would you drag around pointers to malloc, calloc and free, when all you need is just a pointer to realloc?

This sort of thing. The code is nice, but it's not how one would write a container library after few years of hands-on C experience.

seiji · on March 13, 2016

allocate a separate list node for each piece of data

So many programmers don't even consider "overhead" of data when writing things. But, most of the time it doesn't matter. Do you need a list of ten things? Great. Do whatever. Do you need a list of a billion things? Then you need to rethink everything from the bottom up.

when all you need is just a pointer to realloc

Wrong! https://github.com/Tarsnap/libcperciva/commit/cabe5fca76f6c3...

catnaroek · on March 13, 2016

> So many programmers don't even consider "overhead" of data when writing things.

Because they're using managed languages where the amount of overhead per object can't be reduced to zero. And also because, in exchange for that overhead, they get other benefits, like compacting garbage collection. But you don't get this in C.

eps · on March 13, 2016

> Wrong! https://github.com/Tarsnap/libcperciva/commit/cabe5fca76f6c3....

No, of course not "Wrong!"

If a user of a library wants to supply their own allocator, they must provide a version of realloc that acts as free() for the (p,0) case. That's the semantics the library expects, observe them. Internally the lib may look at whether a custom realloc is set, and fallback to malloc/free if it's not.

dllthomas · on March 14, 2016

"Why on Earth you'd want to allocate a separate list node for each piece of data when you can embed this node directly into the data"

It's a good approach when the items may need to be in several lists at once.

junke · on March 14, 2016

You can certainly put the payload into multiple lists. From http://www.crashcourse.ca/introduction-linux-kernel-programm...:

> More to the point, a payload is welcome to participate on several lists, but only after defining an internal list_head object for each list. In other words, unlike the payloads you're used to that contain only pure data and could (theoretically) be on an infinite number of lists at once, objects on kernel lists can be only on those lists for which they have an internal set of links. It's an interesting way to lock down what lists you're willing to be a member of.

dllthomas · on March 14, 2016

Sorry, yes, rather the problem comes from a sufficiently dynamic set of list memberships.

vc98mvco · on March 13, 2016

A quick look shows this is storing void* pointers only. My excitement was quickly over as I was expecting something with _Generic macros or ability to manage memory for any type.

https://github.com/srdja/Collections-C/blob/master/src/hasht... shows the container is manually configurable. Look at the defines at the bottom. This would mean the container file source-header pair, would have to be duplicated for every type, which is somewhat manageable.

Why isn't this written in the readme?

deadgrey19 · on March 13, 2016

<shameless self promotion> Take a look at this instead. Work started on this before C-generics came about, so there is pretty heavy macro usage, but all data structures are type safe: https://github.com/mgrosvenor/libchaste

vc98mvco · on March 13, 2016

I see you use C macro templates. I have been experimenting with containers that use those. The advantages are speed and full type safety, and disadvantages are long compilations times (which have to be done only once though), the need to define a new template for every type in (only one) separate file, code bloat, and somewhat long function names (I don't use member function pointers).

I'm not sure if all that is worth it.

deadgrey19 · on March 14, 2016

I've not noticed the compile time increase to be honest. But I have noticed the type safety. A few times it has saved me from bugs that would have taken days to find if I was using void* casts. In my world (mostly high perf networking) I'm happy to pay a (small ~= unnoticeable) compile time cost if I get both safety and performance.

To make it more user friendly, I've provided (yet more) macros to make defining new types super easy. Using the macros does not necessarily involve adding a separate file, although the code is much cleaner if you do. And I don't have anything against adding extra files. There's almost no cost to me. I often group a couple of vectors or linked lists into a single .c/.h file pair to keep things manageable.

The code bloat you get is on par with using C++ templates. In fact, I'd say that it is probably better because the type safe macro's all devolve to a void* underlying implementation rather than generating separate instances. So it's pretty thin. You can always throw away the standard types list that I provide so that you don't have to pay for anything you don't use. And if you really want to, you can interface to the void* underlying implementation directly, loose the indirection costs, the function pointers and the macro costs. Basically, you can choose which world you want to live in, or live in both at the same time.

The long function names are handled by macros again. And since i used function pointer indirection you almost never see them. The costs with this sort of thing are minimal. A typical invocation looks something like

  CH_VECTOR(MYVECCLASS)* myvec =  NEW_CH_VECTOR(MYVECCLASS);
  myvec->append(myvec,object);

There are certainly tradeoffs, the implementation is not nearly as mature as I'd like and debugging problems inside the containers is a HUGE PITA. But in my experience it lets me kee the benefits of working in C (can pull inside the kernel, can port to different machines, easy control over memory and performance) and gives me the flexibility of doing higher level things when I want.

wangchow · on March 13, 2016

Would be nice to have parallel structures like "Intel threading building blocks" provides.

https://en.m.wikipedia.org/wiki/Threading_Building_Blocks

I recommend this book to understand such structures:

https://www.goodreads.com/book/show/14788830-structured-para...

srdja · on March 13, 2016

I'll definitely look into it.

srdja · on March 13, 2016

Hello everyone, I'm the author of this library. I was quite surprised to find it on the front page of HN! I must thank you for all the great feedback, it's been very helpful. I agree that it has rough edges and that it also lacks some useful features but I guess this is mostly because I was the only user of it.

I very much welcome your suggestions on how to make better and more useful. :)

kazinator · on March 13, 2016

Kazlib: since 1996.

http://git.savannah.gnu.org/cgit/kazlib.git/tree/

http://www.kylheku.com/~kaz/kazlib.html

Used in e2fsck since 2002: http://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/commit/e2fs...

naasking · on March 13, 2016

klib [1] is another good one I recently came across. In fact, I'll just leave this great curated list of free software libraries right here:

https://notabug.org/koz.ross/awesome-c

[1] https://github.com/attractivechaos/klib

deadgrey19 · on March 13, 2016

<shameless self promotion> If you're interested in this sort of thing, take a look at https://github.com/mgrosvenor/libchaste. Work started on this before C-generics came about, so there is pretty heavy macro usage, but all data structures are type safe rather than the "cast to void*" approach which is common in this space. It's a bit less mature, but reasonably well covered by unit tests where it matters.

brongondwana · on March 14, 2016

I tried reading the documentation, which is awfully sparse, to see how to use the hash table and if it would be a good replacement for the current one in cyrus-imapd.

The destroy function in cyrus-imapd has both a destroy that takes a cleanup function for values, and an interator:

https://github.com/brong/cyrus-imapd/blob/master/lib/hash.c

I couldn't see how to do either with this library until reading the source code I see you can set a mem_free at creation instead.

The documentation didn't really describe that, and the example is actively wrong.

I see from the code that there's a way to get the keys, from which you can do an external iterator. It's less efficient, but it works. You wouldn't know that from the documentation though, you have to read the source code.

fit2rule · on March 14, 2016

Any chance you're going to submit a pull request to fix the documentation? I'd like to check this library out for use in a project, but I'm not sure I'd get so far as you have with regards to matching the documentation to the sources ..

mschuster91 · on March 14, 2016

From hashtable (https://github.com/srdja/Collections-C/blob/master/src/hasht...)

> HashTableConf htc;

Hold short, that structure is allocated from the stack and not via malloc? I'd expect that one to blow up after the second call to hashtable_new.

edit: seems like htc is immutable and the values copied over in hashtable_new_conf, but that's downright scary in case one forgets this semantic and accidentally uses htc for something.

jepler · on March 13, 2016

It looks like all collections store void pointers. I don't see any abstractions to help with pointer lifetimes, which remains the main problem I encounter when trying to write large programs in C that dynamically allocate memory.

yoklov · on March 13, 2016

Start using arenas wherever you can. Makes memory management in C programs substantially simpler.

eatonphil · on March 13, 2016

Any good readings on getting started with arenas?

pieterr · on March 13, 2016

Yes. This book (free online): http://www.smallmemory.com/book.html

http://www.smallmemory.com/6_AllocationChapter.pdf

akp__ · on March 13, 2016

I'm no expert here, just curious. I've seen this brought up a couple times now: what's wrong with void pointers? How else would you do this in C? If you wanted abstractions to help with pointer lifetimes wouldn't you just be better off using C++?

lucozade · on March 13, 2016

The limitation with void pointers is that it can only inline values <= the size of a pointer. If you have something larger then you need to allocate it separately which may have cache efficiency issues.

The main alternative, other than hardcoding an element type, is to use macros and pass the element type in the definition.

Sometimes you just don't have the option of a C++ compiler. Also, and this was a situation I was in recently, I needed a collection in a small area of my C code so it really didn't warrant a toolset change.

woodman · on March 14, 2016

Void pointers mean no type safety, preventing a lot of potential compiler optimizations and static analysis checks. 99% of the time the right thing to do is just create a new type:

  typedef struct Strref_t {
    char* str;
    size_t ref;
  } Strref;

There, just pass a pointer of that type around.

stinos · on March 13, 2016

It looks like all collections store void pointers

Does that mean that if you're running as 32bit and want to store a number larger than that (a double or int64_t or so) or a POD that all of these have to be allocated seperately?

gibsjose · on March 13, 2016

Has this library been stress tested? I am interested in using this library for a spaceflight application, but reliability is paramount in our situation.

asveikau · on March 13, 2016

I stopped looking when I saw it doesn't check for malloc failure. I am going to say space flight is a no-go.

TickleSteve · on March 14, 2016

This is quite common because the alloc() call itself would panic if out-of-memory.

Remember, a large proportion of C code is for embedded/real-time systems where the only sensible recovery option is to reset.

Attempting to 'handle' an error manually in every case is simply not possible. Better to reset and come up in a clean state than propagate errors.

Having said that, dynamic memory allocation is itself frowned upon in embedded systems.

david-given · on March 13, 2016

But for this kind of application, wouldn't you just panic and do a soft restart on malloc failure anyway? Actually recovering from an out-of-memory situation is incredibly fraught and unreliable; most of the realtime operating systems I've used don't even bother to try.

gibsjose · on March 13, 2016

That depends on when a failure were to occur. If we have a failure while the spacecraft is en route to its destination, perhaps we can recover because we may have enough time.

If we have a failure during rendezvous with our target, it could be a very bad day.

david-given · on March 16, 2016

...necropost, but:

You always need to be able to recover from a soft reboot, even during maneuvers; you're in a high radiation environment and any passing high energy particle or cosmic ray can trigger this.

TickleSteve · on March 14, 2016

Dynamic memory allocation is extremely frowned-upon in embedded/real-time systems such as flight systems for exactly that reason. You simply cannot handle out-of-memory and fragmentation issues in a safe manner.

duaneb · on March 13, 2016

Even then you'd want proper logging and panicking over a segfault (the likely result).

david-given · on March 13, 2016

In such an environment, malloc() will do that.

rewqfdsa · on March 13, 2016

Thank you. There is somebody else on this fucking planet who understands that memory is not limitless manna from heaven.

gibsjose · on March 13, 2016

Ouch. I'll keep looking then.

Any recommendations for reliable C data structures?

raarts · on March 13, 2016

Don 't take this personally, but it worries me that someone who is not well versed with things like safe memory structures is tasked with writing software like this.

gibsjose · on March 14, 2016

I'm not sure what aspect of asking about the level of testing performed on an open source library gives you that idea, but sure: no offense taken.

TickleSteve · on March 14, 2016

Don't dismiss it because of that, handling OOM errors is best handled by the alloc call by panicing.

If that situation is not allowed, I would suggest you shouldn't be performing memory allocation (or any other resource allocation) dynamically.

ryandrake · on March 13, 2016

Some of the functions do check for malloc failure and others don't. It's on github--Have you submitted a pull request? Wow--guess the idea of sending a patch to an open source project is now controversial.

junke · on March 13, 2016

Give Ada a try

gibsjose · on March 13, 2016

[Un]fortunately, we are tied to C/C++ due to mission requirements and our hardware platform.

Mellowcandle · on March 13, 2016

nice. I actually built something very similar a while ago. although I only supported lists, queue and stacks... https://github.com/mellowcandle/liblist

ape4 · on March 14, 2016

Something like this should be in standard C.

YesThatTom2 · on March 13, 2016

Take that, Golang!!!!

J/K

rewqfdsa · on March 13, 2016

Honestly, why? Most of the time, you want to write C++, with exceptions, and the STL. You can make this kind of programming as robust as you want against memory allocation failure.

If you don't want to use this style of programming, for whatever reason, check out sys/queue.h. It's already on your system, if you're using some kind of Unix.

gibsjose · on March 13, 2016

If you are using an RTOS (which many timing-critical applications run on), you should be worried about memory allocations happening under the hood.

If you don't have full control over when memory is being allocated/reallocated, your system is now non-deterministic.

With pure C, you can know exactly when those few extra instructions for resizing your dynamic array are going to happen.

catnaroek · on March 13, 2016

Same in C++. There's `std::vector::reserve`, which grows the vector's underlying physical buffer, without logically adding any new elements. If you `reserve` enough capacity before inserting anything, it's even a `O(1)` operation.

gibsjose · on March 14, 2016

Vector and possibly string have that. The vast majority of STL structures do not.

catnaroek · on March 14, 2016

All containers from the C++ standard library (please don't call it STL, that's Stepanov's original library) can be parameterized by an allocator. You can use whatever allocation policy you like best. However, most people use the default allocator because it's good enough.

In any case, while C++ has lots of defects, “loss of control relative to what C gives you” isn't one of them.

optforfon · on March 14, 2016

(a bit of topic - but I'm humbly trying to learn more...) I've got zero experience writing allocators. Is there some common ones you use provided somewhere? Do you write your own? (in which case can you point me to where to learn to do that properly)

catnaroek · on March 14, 2016

I don't write allocators myself, but Boost has a pool allocator library [http://www.boost.org/doc/libs/1_60_0/libs/pool/doc/html/inde...], which conforms to the Allocator concept defined in the standard library [http://en.cppreference.com/w/cpp/concept/Allocator].

optforfon · on March 15, 2016

Thank you

gibsjose · on March 14, 2016

Yeah, that's true.