MillenniumDB: Property graph and RDF engine, still in development

j-pb · 2025-01-31T18:49:46 1738349386

These guys write really great papers!

We implemented a simplified version of their ring index for our data space (https://github.com/triblespace/tribles-rust/blob/master/src/...), and it's a really simple and cool idea once you wrap your head around it. Funnily enough, we build this even before the paper was officially published, because we found a preprint on one of the authors blogs. The idea itself was published by them before but their new paper made this a lot easier to understand. (burrows wheeler transforms vs. stable column sorting).

It's really too bad that the whole linked-data space is completely gunked up with RDF.

Ps: If anyone plans on implementing their ring index, using 0 based offsets makes the formulas much more streamlined, their paper uses 1 based indexing and they have to +/-1 all over the place.

bawolff · 2025-02-01T00:20:24 1738369224

The whole ring index thing is one of the more fascinating ideas i've read about (i didnt realize milleniumDB was same authors). Sent me down a whole rabbit hole of learning about succinct data structures and burrows-wheeler transform.

Sometimes you encounter a computer science idea that just sounds like pure magic.

smarx007 · 2025-01-31T17:26:50 1738344410

I think if someone is just trying out RDF, it is better to start with Apache Jena/Fuseki or Eclipse RDF4J. Maybe https://github.com/oxigraph/oxigraph if you like to live dangerously (i.e. to use pre-1.0 DBMSs).

Use of other systems involves factoring tradeoffs and considerations that are probably not the best for the newcomers. For example, qLever mentioned here is good in query performance and relative disk use but once the import is done, it's essentially a read-only DB and completely unsuitable for a typical OLTP scenario.

Having said that, the Chilean research group that is driving the development of MilleniumDB is very well-regarded in the RDF/semantic web querying space.

FjordWarden · 2025-01-31T18:41:12 1738348872

If you expect Jena to be more battle-tested because it is older, forget it, if the process is killed by a unexpected shutdown or some other reason it results in data corruption. At least this was my experience a few years ago.

I found graph databases a beguiling idea when I first learned about them, and this is a welcome addition, but I've since temperated my excitement. They are not as flexible and universal a modal as is often promised. Everything is a graph, sure but the result of your SPARQL query not necessarily.

I found classical DBMS based on sets/multisets to be much easier to compose from a querying point of view. A table is a set/multiset and a result of a query is also a set/multiset, SPARQL guarantees no such composability. Maybe, if you want to start mucking around with inference engines, but you'll either run into problems of undecidability.

PaulHoule · 2025-01-31T18:53:55 1738349635

Jena lets you make little in-memory triple stores that you can use the way people use the list-map-scalar trinity. I've been working on this publication about that (RDF for difficult cases and when ordering counts) for years and it just got published last week

https://www.iso.org/standard/76310.html

I'll call out my collabortor Liju Fan for being the only person I've met who knew how to do anything interesting with OWL. (Well, I can do interesting things now but I owe it all to her.)

(For the research for that paper I used rdflib under PyPi because CPython was not fast enough.)

When I needed big persistent triple stores (that you use the way you might use postgres) I used to use

https://en.wikipedia.org/wiki/Virtuoso_Universal_Server

and had pretty good luck if I loaded a billion triples if I used plenty of 'stabilizers' (create a new AWS instance with ample RAM, use scripts to load a billion triples starting from an empty database, shut it down, make an AMI, start a new instance with the AMI, expect it to warm up for 20 minutes or so before query performance is good)

I don't regularly build systems on SPARQL today because of problems with updating. In particular, SQL has an idea of a "record" which is a row in a table, document oriented databases have an idea of a "record" which is a bit more flexible. Updating a SPARQL database is a little bit dangerous because there is no intrinsic idea of what a record is; i mean, you can define one by starting at a particular URI and traversing to the right across blank nodes and saying it is a 'record' and it works OK. But it's a discipline that I impose on it with my libraries, it ought to be baked into standards, baked into the databases, wrapped up in transactions, etc. For anything OLTP-ish I am still using SQL or document-oriented databases, but I hate the lack of namespaces and similar affordances that make SPARQL scalable in terms of "smash together a bunch of data from different sources" in document-oriented databases wheras SPARQL is missing the affordances you have in document-oriented databases for handling ordered collections. We badly need a SPARQL 2 which makes the kind of work that I talk about in that technical report easy.

smarx007 · 2025-01-31T20:21:17 1738354877

> Updating a SPARQL database is a little bit dangerous because there is no intrinsic idea of what a record is

SPARQL has a notion of a transactional boundary just like SQL has. You can combine multiple SPARQL queries in one transaction, they will all succeed or all fail just like you'd expect.

PaulHoule · 2025-01-31T20:39:51 1738355991

Sorta kinda.

Your code has to put the right things in a transaction all the time for transactions for transactions to work right. If there is some flow of information like

   application does query -> application thinks -> application does update

you have to wrap the whole sandwich in a transaction, people frequently don't do that. If I'm writing 20 of those for an application I want something that I know is bulletproof.

My experience with SQL is that the average SQL developer doesn't really understand how to do transactions right but their ass gets saved (in a probabilistic sense) by the grouping of updates that is implicit by running an INSERT or an UPDATE against a table.

There's also the fact that a lot of triple stores are seriously half baked research-quality code if that. Many triple stores struggle if you just try to load 100,000 triples sequentially, for an application like my YOShInOn RSS reader which I expect to use every day and not have to patch or maintain anything for 18+ months. (Ok, a 20GB database that needs to be pruned crept up on me gradually, but that's an arangodb problem, I'd expect the average triple to store to have crumbled 17 months ago.)

I'd love to have something that updates like a document-oriented database but lets you run a SPARQL query against the union of all the documents. Database experts though always seem to change the subject when it comes to having a graph algebra that lets you UNION 10 million graphs.

(For that matter, I sure as hell couldn't pitch any kind any kind of "boxes-and-lines" query tool [1] etc. that passed JSON documents/RDF graphs over the lines between the operators to the VCs and private equity people who were buying up query engines circa 2015 because they were hung up on the speed of columnar query engines... Despite the fact that the ones that pass relational rows over the lines require people who really aren't qualified to do so create analysis jobs that look like terrible hairballs because of all the joins they do.)

[1] Alteryx, KNIME

smarx007 · 2025-01-31T20:57:41 1738357061

> you have to wrap the whole sandwich in a transaction

True, SPARQL does not allow "opening" transactions such that you can run one query, do some logic, and run another query while doing commit. Which was a pain for me. RDF4J has a non-standard API to do that, I think they are trying to upstream it to SPARQL 1.2.

> There's also the fact that a lot of triple stores are seriously half baked research-quality code if that.

Also true. Although excellent researchers who wrote one of the best reasoners (Pellet) decided to leave academia and make a production grade system. They succeeded with Stardog but you don't want to know how much a license costs.

> couldn't pitch any kind any kind of "boxes-and-lines" query tool [1] etc. that passed JSON documents/RDF graphs

I really enjoy this talk from one of the creators of OWL [1]. There, he makes a point that OWL is unpopular not because it's too complex but because it's not advanced enough to solve real problems people care about (read: ready to pay money for). I think the case you described involves VCs having clarity on how to make money off one thing but not the other. I do think that the Semantic Web 3.0 (if we count Linked Data as a Semantic Web 2.0 aka Semantic Web Lite attempt) will need a better (appealing to business) case than the one presented in the 2001 SciAm paper.

[1]: https://videolectures.net/videos/eswc2011_hendler_work

kendallgclark · 2025-01-31T23:39:53 1738366793

OWL ontologies making a big comeback as part of Knowledge Graph groundings for LLM outputs. And several SPARQL and RDF knowledge graph startups are VC-baked and thriving. The world is a big place.

smarx007 · 2025-02-01T00:06:31 1738368391

Well, there is the new use case that appeals to VCs! And I guess it's a good reminder that I should re-subscribe to your blog :)

PaulHoule · 2025-01-31T21:08:12 1738357692

Personally I thought Stardog was trash, but if I'd had different requirements I might be happy with it.

The trouble w/ OWL as I see it (talked about in that TR) is that people don't really want "first order logic", but they want "first order logic + arithmetic" which is a nightmare that Kurt Godel warned you about. (That ISO 20022 which that TR is related to is about the financial domain which is all about arithmetic)

After Doug Lenat's death a lot of stuff came out that revealed the problems w/ Cyc, not least that even if you try to build something that is "knowledge based" it can't practically solve all the problems you want using a SMT-based strategy but you have to build a library of special purpose algorithms for everything you want to do and it turns out to be a godawful mess.

I'm disappointed that the semweb community hasn't made a serious crack at usable and efficient production rules (dealing w/ problems like negation, controlling execution order, RETE execution, retraction) instead we get half-answers like SPIN with fixed-point execution (used an even more half-baked version of that to research that TR, gets you somewhere). Of course, production rules never got standardized in any domain because nobody can agree on the way to address those four issues even though it usually isn't hard to find an answer that's fine for a particular application.

(It's a frequently problem that experts on a technology can get by on half-baked specific answers that would need a general solution if they were going to be useful for a general audience. One reason why parser generators are so bad is that if you understand parser generators enough to write a parser generator you aren't bothered by the terrible developer experience of parser generators.)

kendallgclark · 2025-01-31T23:35:49 1738366549

You seem nice.

PaulHoule · 2025-02-01T03:18:16 1738379896

Sorry for the negativity Kendall but the semweb didn't return the love that I gave it. I did hundreds of sales calls that went nowhere, but my phone kept ringing for people who wanted me to work on neural nets.

kendallgclark · 2025-02-02T03:48:40 1738468120

That’s tough. Not sure what that has to do with Stardog. Biggest companies in the world rely on it daily and you say it’s trash. I couldn’t find an email from you using it since 2013. I guess we figured something out. NNs are cool too; at last count we use half a dozen different ones including GNNs… NeSy is hot and I can hardly read a paper these days that doesn’t talk about triples.

PaulHoule · 2025-02-02T13:46:52 1738504012

(1) I'll grant it was a long time ago. Things could have changed a lot.

(2) It's generic that a new database comes out, gets hyped, but turns out to be "trash" when you try to use it. If a new database was actually good that would be exceptional. (Probably in 2013 it satisfied somebody's requirements but the hype for Stardog in 2013 seemed to be entirely out of line with what I needed for the project I was doing at the time)

I thought Postgres was trash in 2001 and called it CrashGreSlow, now I swear by it. Early on people were making big claims for it that were not substantiated but people did the hard work over a long time to make it great.

I thought mongodb was trash when it came out, then I worked for a place that used it despite the engineers believing it was trash and begging me not to use it for a spike prototype. It never got better. Now it is common knowledge that mongodb is trash.

(3) Maybe it's not fair but I was hurt by the experience, my wife was furious at the balance I'd run up on the HELOC chasing my Moby Dick. As an applications programmer who was accustomed to getting things right I had a terrible opinion of most of the luminaries in the semantic web field at the time many of whom were shipping code that was academic quality at best.

UltraSane · 2025-02-01T11:25:52 1738409152

You mean brutally honest.

svilen_dobrev · 2025-01-31T19:39:28 1738352368

datomic (and partially xtdb /former crux) are OLTPish, and use only such "tuples" , essentially it's up to the user to define what constitutes an entity if at all ("row", "object", "document", whatever) - maybe some entity-id and everything linked to it, but maybe other less-identity-related stuff. Which might feel freeing to extent, but as you said, also expects great responsibility/discipline to cobble the proper properties together.

PaulHoule · 2025-01-31T20:04:06 1738353846

Mathematically the boundaries of a record can be defined by production rules

https://en.wikipedia.org/wiki/Business_rules_engine

which could be written as SPARQL queries, I've used these to cut records out of a big graph, I haven't thought seriously if these could be built into a large scale general purpose systems.

The most fun I ever had with Jena was when I used the rules engine for the control plane of a batch processing system which used stream processing primitives [1]

https://jena.apache.org/documentation/inference/

The Jena folks said my use was completely unsupported, I had looked at the source code and got to understand how the rules engine worked and I knew damn well there was nothing wrong with what I was doing.

I've thought a lot about why production rules have had so little impact on the industry, I mean people really hate drools

https://www.drools.org/

That kind of system is particularly strong at handling deep asynchrony, like when a business process at a bank might involve some steps where you might have to wait for a loan office to approve a loan. It's disappointing to me that nobody has tried to use them (so far as I can tell) to deal with the asynchronous comms problems in Javascript though I've yet to get a clear picture in my mind about how to get started on that. (Funny I am getting an idea now so I'm putting a ticket on my personal Kanban board)

[1] I worked later at a place that had a similar engine written in very awkward Scala that allegedly used Either and Optional for error handling but actually dropped errors most of the time; I knew what algebra my engine supported, they argued whether or not something like that had an algebra; my engine got the same answers every time because it tore down the system properly at the end, their engine gave different answers every time but they didn't seem to care

smarx007 · 2025-01-31T20:17:31 1738354651

I said suitable for newcomers aka people touching RDF for the first time. If you want production-ready, you probably want Stardog, Ontotext GraphDB, or AWS Neptune - neither is cheap. https://github.com/the-qa-company/qEndpoint is also an interesting project that's used in production.

zozbot234 · 2025-01-31T18:43:34 1738349014

> SPARQL guarantees no such composability.

SPARQL has a CONSTRUCT clause which gives you RDF as your query output. Isn't that compositional enough?

FjordWarden · 2025-01-31T19:46:47 1738352807

Ok, that is true, but how do I tell my graph database that the result of the construct query is some other graph in my DB?

smarx007 · 2025-01-31T20:34:57 1738355697

> how do I tell my graph database that the result of the construct query

I am assuming you are asking how to do a CONSTRUCT query that will return you the contents of a given named graph?

https://www.w3.org/TR/sparql11-http-rdf-update/#http-get is a much simpler way to get a graph. As the spec says, it's equivalent to the following query

       CONSTRUCT { ?s ?p ?o } WHERE { GRAPH <graph_uri> { ?s ?p ?o } }

UltraSane · 2025-02-01T03:28:23 1738380503

Property graphs map to relational databases pretty well. Using Neo4j's terminology

table name -> node label

table row -> node

table column -> node property

The result of a query is a sub-graph and is very composable.

AtlasBarfed · 2025-02-02T19:54:42 1738526082

I think it maps much better to document databases.

Nodes are just documents.

You just need to slap on a relations document type for the graph edges, and to store edge properties

I was close to finishing at least version 1.0 of a document/graph database on top of Cassandra and dynamodb.

UltraSane · 2025-02-03T04:09:57 1738555797

With Neo4j properties and list members cannot be complex objects just like DB table rows. I was thinking that my dream DB would be a hybrid of MongoDB documents with Neo4j type relationships between them.

hobofan · 2025-01-31T18:53:10 1738349590

As someone that has built production systems with Oxigraph (and a bit less with Jena), I'd recommend Oxigraph over Jena any day. Especially if you have you are working with a Rust-based tech stack.

You can save so much time and headache based on less operational complexity and the architectural options it opens up. If you only reinvest part of that into building a framework for versioning/backups, etc. you'll have a much better overall package.

Karrot_Kream · 2025-02-01T09:17:20 1738401440

Interesting. Do you think RDF is easier to work with in Rust's ecosystem than Java's ecosystem as a whole in your opinion? I've only touched Jena and worked with Java and Go systems with RDF.

iddan · 2025-01-31T18:54:11 1738349651

Systems like Apache Jena are not production ready for anything serious. It makes total sense to start something different

spothedog1 · 2025-01-31T22:14:26 1738361666

Definitely do not start with Jena/Fuseki, pain in the ass to set up. Start with Oxigraph or rdflib in memory to play around with how to query/interact with the graphs

WhatIsDukkha · 2025-01-31T16:06:42 1738339602

Here is a bug with some back and forth between millenniumdb and qlever in starting a benchmarking attempt but I don't see results, though they managed to build and import.

https://github.com/MillenniumDB/MillenniumDB/issues/10

https://github.com/ad-freiburg/qlever

sunshine-o · 2025-01-31T23:27:17 1738366037

I got very interested in RDF about 20-25 years ago.

Obviously it did not really succeeded but it seems some industries invested a lot into the tech and it is still around. Especially since AWS built a service around it.

I am really curious, what are the top use cases for it today?

bawolff · 2025-02-01T00:22:15 1738369335

I think wikidata (https://query.wikidata.org) is one of the more well known ones.

jerven · 2025-01-31T17:58:25 1738346305

MilleniumDB is an interesting engine, as is Qlever mentioned in other comments. I think both are good candidates at making RDF graphs one or two orders of magnitude cheaper to host as sparql endpoints.

Both seem to have arrived at the stage of transitioning from research to production code.

Very exiting for those of us providing our data in RDF and exposing Sparql.

AWs Neptune analytics is also very interesting, allowing Cypher on RDF graphs. Even the Oracle inbuilt RDF+Sparql seems to have improved greatly in 23ai.

UltraSane · 2025-02-01T11:30:59 1738409459

It seems like writing Cypher to query RDF would be hard.

jitl · 2025-01-31T15:53:16 1738338796

Is it any good?

UltraSane · 2025-01-31T16:11:41 1738339901

What is a domain graph?

throwaway867500 · 2025-02-03T10:40:40 1738579240

"Domain Graph" [1] was renamed to "Multilayer Graphs" [2].

The "Multilayer Graphs Model" was aimed to address the limitations found in prior Graph Models (e.g. RDF, Property Graphs) in representing higher-arity graphs without having to resort to reification or reserved words/vocab.

Skipping over the formal math definitions, a Multi-Layer Graph, in practical terms, is represented by statements of "quads": "{edge id, source, label, target}" -- similar on the surface to RDF-named graphs, but not.

The source and target may also refer other edges in addition to referring to the "entities" of real-world-concepts; targets may also be data types (e.g. strings, ints, etc)-- I believe that MilleniumDB puts edges, sources, targets and some simple values under the same packed int64 namespace.

Useful for MilleniumDB under-the-hood design/architecture, and probably for Wikidata where their data model are qualified facts-- a qualifying statement (e.g. "valid from 2020 to 2024") about a factual statement ("Alice Lived in America").

But this is just me, a non-expert, trying to cut to the core points after discovering and reading the papers just recently since knowledge graphs are hot topics right now.

[1] https://arxiv.org/abs/2111.01540 [2] https://users.dcc.uchile.cl/~ahogan/docs/mutlilayer_graphs.p...

UltraSane · 2025-02-04T01:46:23 1738633583

allowing edges to have edges is something that RDF* allows.

Property graphs DBs like Neo4j don't support it but you can do it by using a node as a relationship. This is called a metanode or a hypernode. The need for this is mitigated somewhat by the fact that property graphs allow edges to have properties themselves. so you would use

(Alice)-[:LIVES_IN{valid_from:2020,valid_to:2024}]-(USA)

Edit: Read the paper and it is actually an attempt to unify the RDF, RDF* and property graph data models which is VERY interesting.

leetrout · 2025-01-31T17:13:11 1738343591

Weird title here. The repo says "Property Graph and RDF engine, still in development" with no mention of domain.

dang · 2025-01-31T17:20:04 1738344004

We've changed the title to that of the page. (Submitted title was "MillenniumDB: A graph database engine using domain graphs")