I recently pulled DuckDB out of a project after hitting a memory corruption issue in regular usage. Upon investigating, they had an extremely long list of fuzzer-found issues. I just don't understand why someone would start something in a memory unsafe language these days. I cannot in good conscience put that on a customer's machine.
We ended up rewriting a component to drop support for Parquet and to just use SQLite instead. I love the idea of being able to query Parquet files locally and then just ship them up to S3 and continue to use them with something like Athena.
The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality. DuckDB (unironically) needs a rewrite in Rust or a lot more fuzzing hours before I come back to it. While SQLite is not written in a memory safe language, it is probably one of the most fuzzed targets in the world.
Starting in this release, the DuckDB team invested significantly in adding memory safety throughout catalog operations. There is more on the roadmap, but I would expect this release and all following to have improved stability!
That said, at my primary company, we have used it in production for years now with great success!
> why someone would start something in a memory unsafe language these days
You might like what we (Splitgraph) are building with Seafowl [0], a new database which is written in Rust and based on Datafusion and delta-rs [1]. It's optimized for running at the edge and responding to queries via HTTP with cache-friendly semantics.
Hmm... thanks. Maybe it was a hiccup? Does it happen when you click these links in my comment? We haven't been able to replicate on Firefox mobile, but we do have an issue with 500 errors in Firefox when double clicking links in the sidebar of the docs (I know, I know...)
>The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality.
It is a limited team size. If they feel a feature is causing too much grief, I would rather they drop it than post a, "Here be dragons" sign and let users pick up the pieces.
Edit: missed an obvious opportunity to take a shot at MySQL
I think the critique is that not that they should have left the thing broken, but that a limited team should limit the work to match the team size so that they do not release broken things in the first place.
I wouldn't consider any of those written in a memory safe language. Although SQLite has been battle hardened over many years, while DuckDB is a relatively new project.
That being said, has been efforts of reimplementing SQLite in a more memory safe language like Rust.
At the level of engineering of SQLite, the choice of language is almost immaterial. Suggesting a low effort transpilation is a competitive peer seems unserious and vaguely disrespectful.
>Finally, one of the best written software paired with one of the best writable programming language‽ Fearless and memory safe, since the uncountable amount of unsafe {} blocks makes you not care anymore.
Plus it seems project is a parody of the RiiR trend.
The CVE list would dispute that assertion. There's a reason Microsoft is rewriting parts of the Windows kernel in Rust, and it isn't trendiness or because the kernel is at a trivial level of engineering.
It's the same reason Torvalds refused to have C++ anywhere near the Linux kernel but is not accepting patches in Rust. The advantage of C is its transparency and simplicity, but its safety has always been a thorn in the industry's side.
RiiR has become a bit of a parody of itself, but there is a large grain of truth from where that sentiment was born.
Life is too short for segfaults and buffer overruns.
In most C++ environments you will have std::string, STL vectors, unique_ptr, and RAII generally. Cleaning up memory via RAII discipline is standard programming practice. Manual frees are not typical these days. std::string manages its own memory, and isn't vulnerable to the same buffer overflow/null terminator safety issues that C-style strings are.
While in C you will be probably using null-terminated strings and probably your own hand-rolled linked list and vectors. You will not have RAII or destructors, so you will have manual frees all over.
Perhaps the big difference is that due to the nature of the language, C developers on the whole are probably more careful.
I started coding professionally in C++ back around 2000. Many things have improved in C++ such as the items you list above, but C++ remains a viciously complicated language that requires constant vigilance while coding in it, especially when more than one thread is involved.
CloudFlare is not lacking in good engineers with tons of both C and C++ experience. They still chose Rust for their replacement of Nginx. Now their crashes are so few and far between, they uncover kernel bugs rather than app-level bugs.
> Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.
I have never heard anything close to that level of reliability from a C or C++ codebase. Never. And I've worked with truly great programmers before on modern C++ projects. C++ may not have limits, but humans writing C++ provably do.
I'm not sure who you're arguing against, but it's not me, or it's off topic completely. The discussion and my reply was not between C/C++ and Rust, but between C and C++.
I'm a full-time Rust developer FWIW. But I also did C++ for 10 years prior and worked in the language on and off since the mid-90s. Nobody is arguing in this sub-thread about Rust vs C++, nor am I interested in getting into your religious war.
If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)
On the other hand, there are garbage collectors available for C and C++ programs (they are not part of their standard libraries so you have to choose whether you use them or not). C++ standard library has had smart pointers for some time, they existed in Boost library beforehand and RAII pattern is even older.
Don't put all the blame for memory bugs to languages. C and C++ programs are more prone to memory leaks than programs written in "memory-safe" languages but these are not safe from memory bugs either.
Disclaimer: I like C (plain C, not C++, though that's not that as bad as many people claim) and I hate soydevs.
Memory leaks are annoying and, yes, you can get them in memory safe languages.
But they are way less severe than memory corruption. Memory unsafe languages are liable to undefined behaviour, which is actively dangerous, both in theory and practice.
> I just don't understand why someone would start something in a memory unsafe language these days. I cannot in good conscience put that on a customer's machine.
We ended up rewriting a component to drop support for Parquet and to just use SQLite instead.
I am not sure that you realize that SQLite is written entirely in C -- a quintessential memory unsafe language. I guess quality of software depends on many things besides a choice of language.
SQLite also lives under a literal mountain of automated tests, an engineering effort I'm not sure I've ever seen elsewhere. The library code is absolutely dwarfed by the test code.
...and CVEs still pop up occasionally. The point about memory safety languages still holds, but can be mostly muted given you throw enough tests at the problem.
Blob to bitstring type casting for Parquet. They were doing a straight reinterpret cast on it which was causing an allocation of 18446744073709551503 bytes.
I was wanting to take a blob from Parquet and bitwise-and it against a bitstring in memory.
A serious and curious question. Are close to the point with LLMs where we can just point to the source of something like DuckDB, and it’s suite of tests and say “rewrite this in Rust with these set of libraries, and make sure all these tests pass”?
Even if not 100% complete, and produces in idiomatic code, could it work?
I don’t even have access to regular Claude so can’t confirm this but the 100K token model they released should in theory be able to handle this to a certain degree.
I haven't tried Claude but I have been tinkering with a lot of this in my home lab and there are various theories I have:
- GPT4 is not a model, it's a platform. I believe the platform picks the best model for your query in the background and this is part of the magic behind it.
- The platform will also query multiple data sources depending on your prompt if necessary. OpenAI is just now opening up this plugin architecture to the masses but I would think they have been running versions of this internally since last year.
- There is also some sort of feedback loop that occurs before the platform gives you a response.
This is why we can have two different entities use the same open source model yet the quality of the experience can vary significantly. Better models will produce better outputs "by default", but the tooling and process built around it is what will matter more in the future when we may or may not hit some sort of plateau. At some point we're going to have a model trained on all human knowledge current as of Now. It's inevitable right? After that, platform architecture is what will determine who competes.
Interesting speculation but I don’t think GPT-4 chooses any model, I’m pretty sure it’s just how good that one model is. I played with a lot of local models but the reality is, even with wizard vicuna, were at least an order of magnitude away from the size of GPT-4.
It is open source, you can do it with whatever you want, like fix it or write it in rust. I did not see the point of complaining it doesn’t work for you because it is not in rust and not doing it.
> The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality.
Yeah, DuckDB has some very cool features, but I with the community were less abrasive. I remember someone asking for ORC columnar format support, and DuckDB replied "that is not as popular as Parquet so we're not doing it, issue closed". Same story with Delta vs Iceberg.
Meanwhile Clickhouse supports both and if you ask for things they might say "tha tis low priority but we'll take a look". Clickhouse-local can work as CLI (though not in-process) DuckDB too.
> DuckDB has some very cool features, but I with the community were less abrasive.
I’m on the Discord and the community in my experience has been anything but “abrasive”. I’m just a random guy and yet I’ve received stellar and patient help for many of my naive questions. Saying they are abrasive because they’re not willing to build something seems so entitled to me.
Focused engineering teams have to be willing to say no in order to achieve excellence with limited bandwidth. I’m glad they said no when they did so they could deliver quality on DuckDB.
I certainly think ORC is a good thing to say no too - in my years of working in this space I’ve only rarely encountered ORC files (technically ORC is superior to Parquet in some ways but adoption has never been high)
Also realize that the team is not being paid by the people who ask for new features. If you’re willing to pay them on a retainer through DuckDB labs then you can expect priority, otherwise the sentiment expressed in your comment just seems so uncalled for.
We ended up rewriting a component to drop support for Parquet and to just use SQLite instead. I love the idea of being able to query Parquet files locally and then just ship them up to S3 and continue to use them with something like Athena.
The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality. DuckDB (unironically) needs a rewrite in Rust or a lot more fuzzing hours before I come back to it. While SQLite is not written in a memory safe language, it is probably one of the most fuzzed targets in the world.