I’m calling these six conceptualizations “memory models”, even though that term used to mean a specific thing that is related to 8086 addressing modes in C but not really related to this. (I’d love to have a better term for this.)
I would just call them "data models". The relational model is clearly a data model; it's based on tables with homogeneous columns. C++ uses the data model of typed memory; Python and Java use the object graph data model, etc.
Another possible name would be "meta-object model". (That term shouldn't require "classes")
Although I'm not sure that they all go together. The first 3 are definitely related (records, object graph, parallel arrays), and the last 2 are (relational vs hierachical file system).
I don't see how pipes really fit in (and I'm writing a shell [1]). A pipe is a fixed-size buffer inside the kernel, outside the address space of any process. That distinction is orthogonal to how you actually organize your memory within a process, which is what the first 3 things are about.
And I think that persistence is an orthogonal concern as well. It applies to all of them except pipes. SQL and hierarchical file systems are persisted in practice, and you will have no problem with COBOL-style records or parallel arrays.
Object graphs have the problem with serialization that he mentions: JSON and Protocol Buffers can't represent cyclic data structures; they represent trees.
The analogy is that Zephyr ASDL is to ML as protocol buffers are to C++ (although protobufs now have unions which could be represented as sum types I think).
Maybe the desire to call them "data models" is an argument that the relational model doesn't really belong in this essay, because it really is much more of a data model (abstractly describing an ontology of mathematical objects) than a memory model (describing how to map that ontology onto bytes in memory).
Yes I agree about the relational model. It is higher level than the rest -- one of the key ideas in Codd's paper was to abstract data away from concrete storage. In contrast, the records, parallel arrays, and to some degree the object graph model are pretty closely tied to concrete storage. C/C++/Go all explicitly specify the memory layout and allow the programmer to control it by design.
And as mentioned, I think the relational model and file system are interesting but orthogonal topics.
I do think this pattern of taking the memory model/data model and "externalizing" into a DSL is interesting (JSON, protobufs and many other schemes, ASDL). That makes it clear that persistence is an orthogonal concern.
One thing I've been thinking about, and which your article helped me hone in on, is that scripting languages almost use the object graph model, but that model is inefficient on modern computers. Pointers are huge and they lead to scattered data.
This seems like a pretty enormous amount of overhead... there is more "metadata" than there is data!
Another point: Someone else mentioned R and pandas. I've been meaning to write a blog post called "R is the only language without the ORM problem". There's no mismatch, because R's data model is the same as SQL -- tables with homogeneous columns (this is in the logical sense). It's meant for "measurements and observations" rather than "business data", but I don't see any fundamental reason why these are different. It's more about R's implementation quirks than the logical model.
So that is another argument that persistence is a separate concern. R has non-persistent tables, but SQL has persistent tables.
Another example is Redis. Redis is a persistent (although it didn't start off that way), but it doesn't use the relational model. I haven't used it too much, but as far as I know it has dictinoaries, sets, and lists. So it looks like a database server but has a different model.
So I think these concerns should be represented in the taxonomy:
- logical vs physical model (logical is what the user sees; physical is concrete storage). You can have an SQL database that is row-oriented or column-oriented. And I noticed that the Jai programming language has this structure-of-arrays vs array-of-structures duality built in.
- Persistence -- each model can be dealt with in-memory or on disk. I didn't know that COBOl dealt with records on disk, which is interesting. A B-Tree is a data structure with pointers, but it's designed for being seralized.
I would just call them "data models". The relational model is clearly a data model; it's based on tables with homogeneous columns. C++ uses the data model of typed memory; Python and Java use the object graph data model, etc.
"Memory model" made me think of concurrency: https://en.wikipedia.org/wiki/Java_memory_model
Another possible name would be "meta-object model". (That term shouldn't require "classes")
Although I'm not sure that they all go together. The first 3 are definitely related (records, object graph, parallel arrays), and the last 2 are (relational vs hierachical file system).
I don't see how pipes really fit in (and I'm writing a shell [1]). A pipe is a fixed-size buffer inside the kernel, outside the address space of any process. That distinction is orthogonal to how you actually organize your memory within a process, which is what the first 3 things are about.
And I think that persistence is an orthogonal concern as well. It applies to all of them except pipes. SQL and hierarchical file systems are persisted in practice, and you will have no problem with COBOL-style records or parallel arrays.
Object graphs have the problem with serialization that he mentions: JSON and Protocol Buffers can't represent cyclic data structures; they represent trees.
I wrote about three domain-specific languages for persisting different data models here: http://www.oilshell.org/blog/2016/12/16.html
The analogy is that Zephyr ASDL is to ML as protocol buffers are to C++ (although protobufs now have unions which could be represented as sum types I think).
[1] http://www.oilshell.org/blog/