I recently pulled DuckDB out of a project after hitting a memory corruption issu...

1egg0myegg0 · on May 18, 2023

Sorry to hear that!

(I work on docs for the DuckDB Foundation)

Starting in this release, the DuckDB team invested significantly in adding memory safety throughout catalog operations. There is more on the roadmap, but I would expect this release and all following to have improved stability!

That said, at my primary company, we have used it in production for years now with great success!

TACIXAT · on May 18, 2023

Yea, legit a cool project and if it wasn't on customer machines and being passed user-defined SQL it would have just been worked around.

chrisjc · on May 18, 2023

Would you mind explaining your issues in a little more detail?

> being passed user-defined SQL

What does this mean exactly? Customers were writing their own SQL on their own machines? Maybe expensive operations such as:

`SELECT ROW_NUMBER() OVER (PARTITION BY THING ORDER BY TS), * FROM EVENTS`

And since it was on a customer's machine, it became a problem you had no control over?

chatmasta · on May 18, 2023

> why someone would start something in a memory unsafe language these days

You might like what we (Splitgraph) are building with Seafowl [0], a new database which is written in Rust and based on Datafusion and delta-rs [1]. It's optimized for running at the edge and responding to queries via HTTP with cache-friendly semantics.

[0] https://seafowl.io

[1] https://www.splitgraph.com/blog/seafowl-delta-storage-layer

krysp · on May 18, 2023

Getting a 5xx error for your site. Using Firefox mobile if that helps

chatmasta · on May 18, 2023

Hmm... thanks. Maybe it was a hiccup? Does it happen when you click these links in my comment? We haven't been able to replicate on Firefox mobile, but we do have an issue with 500 errors in Firefox when double clicking links in the sidebar of the docs (I know, I know...)

fbdab103 · on May 18, 2023

>The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality.

It is a limited team size. If they feel a feature is causing too much grief, I would rather they drop it than post a, "Here be dragons" sign and let users pick up the pieces.

Edit: missed an obvious opportunity to take a shot at MySQL

wpietri · on May 18, 2023

I think the critique is that not that they should have left the thing broken, but that a limited team should limit the work to match the team size so that they do not release broken things in the first place.

fbdab103 · on May 19, 2023

It is often difficult to ascertain how much iceberg is beneath the water until you attempt to solve the problem.

de6u99er · on May 18, 2023

- DuckDB is written in C++

  - SQLite is written in C

I wouldn't consider any of those written in a memory safe language. Although SQLite has been battle hardened over many years, while DuckDB is a relatively new project.

That being said, has been efforts of reimplementing SQLite in a more memory safe language like Rust.

e.g. https://github.com/epilys/rsqlite3

kristjansson · on May 18, 2023

At the level of engineering of SQLite, the choice of language is almost immaterial. Suggesting a low effort transpilation is a competitive peer seems unserious and vaguely disrespectful.

forgotpwd16 · on May 18, 2023

>Finally, one of the best written software paired with one of the best writable programming language‽ Fearless and memory safe, since the uncountable amount of unsafe {} blocks makes you not care anymore.

Plus it seems project is a parody of the RiiR trend.

ttfkam · on May 19, 2023

The CVE list would dispute that assertion. There's a reason Microsoft is rewriting parts of the Windows kernel in Rust, and it isn't trendiness or because the kernel is at a trivial level of engineering.

It's the same reason Torvalds refused to have C++ anywhere near the Linux kernel but is not accepting patches in Rust. The advantage of C is its transparency and simplicity, but its safety has always been a thorn in the industry's side.

RiiR has become a bit of a parody of itself, but there is a large grain of truth from where that sentiment was born.

Life is too short for segfaults and buffer overruns.

paulddraper · on May 18, 2023

to be clear, OP said

> I just don't understand why someone would start something in a memory unsafe language these days.

It takes a lot of time and testing to iron out all the bugs. Not impossible. Just takes a lot of time and testing.

golergka · on May 18, 2023

C has known memory footguns. C++, due to it's complexity, has footnukes.

cmrdporcupine · on May 18, 2023

No, I actually think the opposite.

In most C++ environments you will have std::string, STL vectors, unique_ptr, and RAII generally. Cleaning up memory via RAII discipline is standard programming practice. Manual frees are not typical these days. std::string manages its own memory, and isn't vulnerable to the same buffer overflow/null terminator safety issues that C-style strings are.

While in C you will be probably using null-terminated strings and probably your own hand-rolled linked list and vectors. You will not have RAII or destructors, so you will have manual frees all over.

Perhaps the big difference is that due to the nature of the language, C developers on the whole are probably more careful.

ttfkam · on May 19, 2023

I started coding professionally in C++ back around 2000. Many things have improved in C++ such as the items you list above, but C++ remains a viciously complicated language that requires constant vigilance while coding in it, especially when more than one thread is involved.

CloudFlare is not lacking in good engineers with tons of both C and C++ experience. They still chose Rust for their replacement of Nginx. Now their crashes are so few and far between, they uncover kernel bugs rather than app-level bugs.

https://blog.cloudflare.com/how-we-built-pingora-the-proxy-t...

> Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

I have never heard anything close to that level of reliability from a C or C++ codebase. Never. And I've worked with truly great programmers before on modern C++ projects. C++ may not have limits, but humans writing C++ provably do.

cmrdporcupine · on May 19, 2023

I'm not sure who you're arguing against, but it's not me, or it's off topic completely. The discussion and my reply was not between C/C++ and Rust, but between C and C++.

I'm a full-time Rust developer FWIW. But I also did C++ for 10 years prior and worked in the language on and off since the mid-90s. Nobody is arguing in this sub-thread about Rust vs C++, nor am I interested in getting into your religious war.

alamb · on May 19, 2023

DuckDB is a great piece of software if you are

If you are looking for a query engine implemented in a safe language (Rust) I definitely suggest checking out DataFusion. It is comparable to DuckDB in performance, has all the standard built in SQL functionality, and is extensible in pretty much all areas (query language, data formats, catalogs, user defined functions, etc)

https://arrow.apache.org/datafusion/

Disclaimer I am a maintainer of DataFusion

mm007emko · on May 18, 2023

I remember chasing memory bugs in C# and Java projects at work. These are usually considered memory-safe languages yet their sophisticated garbage collectors are not panacea. There is a reason things like this https://docs.oracle.com/javase/8/docs/api/java/lang/ref/Weak... exist. Or why you might re-use existing objects in certain situations https://www.oreilly.com/library/view/java-performance-tuning... .

On the other hand, there are garbage collectors available for C and C++ programs (they are not part of their standard libraries so you have to choose whether you use them or not). C++ standard library has had smart pointers for some time, they existed in Boost library beforehand and RAII pattern is even older.

Don't put all the blame for memory bugs to languages. C and C++ programs are more prone to memory leaks than programs written in "memory-safe" languages but these are not safe from memory bugs either.

Disclaimer: I like C (plain C, not C++, though that's not that as bad as many people claim) and I hate soydevs.

orra · on May 18, 2023

Memory leaks are annoying and, yes, you can get them in memory safe languages.

But they are way less severe than memory corruption. Memory unsafe languages are liable to undefined behaviour, which is actively dangerous, both in theory and practice.

lr1970 · on May 18, 2023

> I just don't understand why someone would start something in a memory unsafe language these days. I cannot in good conscience put that on a customer's machine. We ended up rewriting a component to drop support for Parquet and to just use SQLite instead.

I am not sure that you realize that SQLite is written entirely in C -- a quintessential memory unsafe language. I guess quality of software depends on many things besides a choice of language.

ttfkam · on May 19, 2023

SQLite also lives under a literal mountain of automated tests, an engineering effort I'm not sure I've ever seen elsewhere. The library code is absolutely dwarfed by the test code.

...and CVEs still pop up occasionally. The point about memory safety languages still holds, but can be mostly muted given you throw enough tests at the problem.

LgLasagnaModel · on May 18, 2023

I’m quite sure that the OP DOES realize this.

gigatexal · on May 18, 2023

This would be a good blog post to read. And I’m sure the team behind the project would like to see it, too.

vamega · on May 18, 2023

What feature was removed? I'm pretty fine with a project deciding to remove a feature with known footguns, but curious what it was.

TACIXAT · on May 18, 2023

Blob to bitstring type casting for Parquet. They were doing a straight reinterpret cast on it which was causing an allocation of 18446744073709551503 bytes.

I was wanting to take a blob from Parquet and bitwise-and it against a bitstring in memory.

datadeft · on May 18, 2023

Is blob an efficient type?

reacharavindh · on May 18, 2023

A serious and curious question. Are close to the point with LLMs where we can just point to the source of something like DuckDB, and it’s suite of tests and say “rewrite this in Rust with these set of libraries, and make sure all these tests pass”? Even if not 100% complete, and produces in idiomatic code, could it work?

syntaxing · on May 18, 2023

I don’t even have access to regular Claude so can’t confirm this but the 100K token model they released should in theory be able to handle this to a certain degree.

Terretta · on May 18, 2023

Is anyone having luck with this? It seems too "creative".

Art9681 · on May 18, 2023

I haven't tried Claude but I have been tinkering with a lot of this in my home lab and there are various theories I have:

- GPT4 is not a model, it's a platform. I believe the platform picks the best model for your query in the background and this is part of the magic behind it.

- The platform will also query multiple data sources depending on your prompt if necessary. OpenAI is just now opening up this plugin architecture to the masses but I would think they have been running versions of this internally since last year.

- There is also some sort of feedback loop that occurs before the platform gives you a response.

This is why we can have two different entities use the same open source model yet the quality of the experience can vary significantly. Better models will produce better outputs "by default", but the tooling and process built around it is what will matter more in the future when we may or may not hit some sort of plateau. At some point we're going to have a model trained on all human knowledge current as of Now. It's inevitable right? After that, platform architecture is what will determine who competes.

syntaxing · on May 18, 2023

Interesting speculation but I don’t think GPT-4 chooses any model, I’m pretty sure it’s just how good that one model is. I played with a lot of local models but the reality is, even with wizard vicuna, were at least an order of magnitude away from the size of GPT-4.

up2isomorphism · on May 19, 2023

It is open source, you can do it with whatever you want, like fix it or write it in rust. I did not see the point of complaining it doesn’t work for you because it is not in rust and not doing it.

glogla · on May 18, 2023

> The other thing that rubbed me the wrong way was that rather than fix the issue, they just removed functionality.

Yeah, DuckDB has some very cool features, but I with the community were less abrasive. I remember someone asking for ORC columnar format support, and DuckDB replied "that is not as popular as Parquet so we're not doing it, issue closed". Same story with Delta vs Iceberg.

Meanwhile Clickhouse supports both and if you ask for things they might say "tha tis low priority but we'll take a look". Clickhouse-local can work as CLI (though not in-process) DuckDB too.

wenc · on May 18, 2023

> DuckDB has some very cool features, but I with the community were less abrasive.

I’m on the Discord and the community in my experience has been anything but “abrasive”. I’m just a random guy and yet I’ve received stellar and patient help for many of my naive questions. Saying they are abrasive because they’re not willing to build something seems so entitled to me.

Focused engineering teams have to be willing to say no in order to achieve excellence with limited bandwidth. I’m glad they said no when they did so they could deliver quality on DuckDB.

I certainly think ORC is a good thing to say no too - in my years of working in this space I’ve only rarely encountered ORC files (technically ORC is superior to Parquet in some ways but adoption has never been high)

Also realize that the team is not being paid by the people who ask for new features. If you’re willing to pay them on a retainer through DuckDB labs then you can expect priority, otherwise the sentiment expressed in your comment just seems so uncalled for.