This is a really important piece for developers to read.
We get very used to the ability to blow away everything and start from scratch, and that's very often the right answer for build tools, but the database is fundamentally different from that, and you need to internalize this.
A database, in most cases, is the business. It's essentially a collection of all the contracts made with customers, and your mode of thinking "I can blow this away and start over" that you've learned from all sorts of build tools is completely contrary to that.
An infamous example of this is in Active Record migrations[1], and possibly some other migration tools.
> Note that we define the change that we want to happen moving forward in time. Before this migration is run, there will be no table. After, the table will exist. Active Record knows how to reverse this migration as well: if we roll this migration back, it will remove the table.
These are smart people, and indeed there are some rare cases where this is the right thing to do. But only our strongly ingrained belief that data is ephemeral would cause us to think this should be the normal behavior, that we'd want to blow away customer data to force the database to track what version X of the code thinks it should.
> A database, in most cases, is the business. It's essentially a collection of all the contracts made with customers, and your mode of thinking "I can blow this away and start over" that you've learned from all sorts of build tools is completely contrary to that.
Thanks for this metaphor. One of the other big benefits to a database-first design is that as your business becomes more complex, there will inevitably be other applications querying the database. If you don't model your constraints in the database you'll end up with a microservices architecture where everything has to call through facades to talk to the database. If you model your constraints correctly in the database, it's much easier to open access to the data to other applications.
The idea of code generating the database schema seems absurd because in all but the first steps of a greenfield app, the code is clueless about what the data means. It's the database's job.
One reason I think we are seeing a renaissance of database level access control is the popularity of connecting mobile applications and web clients directly to the database. Even if you don't think it's a good practice, the pressure mobile and single-page-app developers are putting on APIs like Firebase, Cloudant, and my employer FaunaDB to offer rich access control features means developers have more options.
Here is a write up about how FaunaDB's security features can be used for common app paradigms. By using the database's security features, you can model things like user consent that are relevant for GDPR. https://blog.fauna.com/data-security-in-the-age-of-cloud-nat...
My point being that if you let the database, not the code, be the source of truth, the benefits multiply.
> My point being that if you let the database, not the code, be the source of truth, the benefits multiply.
I know it seems like a revelation to many but this should be something beyond an axiom. Code is ephemeral, data is the only thing that has any kind of longevity. But you're right!
But business’ change and then so does your models? And it can change radically. Or different laws forces you to change the whole database, like with GDPR.
I do agree with the OP just this reasoning doesn’t resonate with my experience in regards with modeling. Quite frankly, because the data is a result of the business and if it changes the data and the models have to do it too.
Yeah, that’s the thing. As programmers we often forget that we only exist to support the business. Code comes and goes, but we wouldn’t have jobs if the business didn’t exist.
There are very, very common cases where this is the right thing to do - development. You'll want to be able to check out a branch, apply the migrations, test the code, and roll back the migrations before you switch off to another branch all the dang time.
When you're limiting yourself to discussion of production environments, absolutely you're correct, rollbacks should be essentially prevented, ideally through tooling that makes them nearly impossible to run, especially without performing a backup first.
I think this idea of development environment exceptionalism is one of the "walls" that the original DevOps movement hoped to remove.
Although I agree that for something like rapid prototyping or experimentation or proof-of-concept, it could make sense to have a mechanism for database changes that is even vastly different from what might ever exist in production.
However, for any routine development, it's important that those procedures match what happens in production, as a hedge against the day when it's necessary to troubleshoot a production issue involving database interaction and to avoid some of the pitfalls inherent in mismatches in the first place.
Coming from the Ops side, I think we Ops folks (including DBAs, to the extent that they exist in smaller environments) need to do more to provide developers with useable, production-like databases that actually match their routine workflows. We may not have quite the same breadth of tools or be able to provide quite same speed/agility in switching branches, but we can get much closer than the status quo.
One way to see it is that the database is a microservice that manages data for all the other microservices and its API is (for example) SQL. After that it should be clear why any of the other microservices should not have the privilege to include the tools to define the data.
Unfortunately Rails or Django or any other framework's migrations are so much more convenient compared to maintaining a schema manually. Any database centric migration tool would have to update the models for all the projects using it. That's not easy AFAIK.
I'm working at a project in which different teams are writing services in Elixir (Phoenix) and C# (no idea what's in there.) We're using different databases and an HTTP API, so the databases and their schemas are hidden inside every (not so micro) service. We didn't share the database because it would have been a nightmare to mirror changes in both codebases.
> One way to see it is that the database is a microservice that manages data for all the other microservices and its API is (for example) SQL.
That's a good point, especially since, in the common (or at least initial) situation, that's how it's approached.
However, I'm wary of taking the concept to an extreme [1] in that many modern RDBMSes are very much not "micro". Besides the storage optimizations that the article mentions, modern RDBMSses are capable embedding programmatic business logic (e.g. stored procedures, triggers), but there seems to be a categorical avoidance of using these features by most devs, even if it means moving huge quantities of data to a different microservice (presumably across a network) to perform the processing.
> Unfortunately Rails or Django or any other framework's migrations are so much more convenient compared to maintaining a schema manually.
The article addresses this by recommending code generation, which is, essentially, the converse of SQL-generation. Is either one more manual than the other?
Perhaps the generated SQL seems less manual, because the database, in that situation, isn't a first-class citizen, as it were. A dropped table is, essentially, invisible to the developer. That is, I believe, the author's point, though.. the appearance of convenience is at the expense of actively ignoring pitfalls.
> Any database centric migration tool would have to update the models for all the projects using it. That's not easy AFAIK.
I'm not sure I follow.. Are you describing one database with multiple "client" applications? If so, it's not clear how an app-centric model would be any easier, with multiple apps all wanting to perform duplicate migrations (or.. duplicated/conflicting rollbacks?).
[1] Granted, this could be a critique of the "microservice" model in general
Client applications should not migrate the database. That's up to the database, with any tool for for the purpose. I'm ignorant in this matter.
Then there is code generation to keep the models up to date. The generation tool should know all the ORMs or whatever used by the clients and modify their models. Saying that this is not mainstream is an understatement. I know some reverse engineering gems for Rails but nothing like that.
> The generation tool should know all the ORMs or whatever used by the clients and modify their models. Saying that this is not mainstream is an understatement.
I'm not sure that there would ever need to be any code-generation tool that knows more than one ORM and is, presumably, bundled with that ORM.
Now, if you're saying that they don't actually exist, currently, for anything other than jOOQ (which is being promoted in the article), I don't doubt that.
However, given something like ActiveRecord's schema discovery functionality, maybe that's good enough when combined with slightly more manual intervention. Other ORMs may be better or worse, of course.
I don't think I would be comfortable doing database schema related work in a branch unless I was also essentially branching my database at the same time. Even for my mostly garbage personal developer database that's filled to the brim with nonsense names and keyboard mashing. The chance of breaking something and wasting a large amount of time and data are too great. Of course, Thou Shalt Have Backups, but it's not exactly a frictionless experience if you fry things and have to go back to them.
Unfortunately it's not quite as quick and easy to clone off a copy every time you want to run some tests or fiddle with a spike as it is to jump between Git branches.
> Of course, Thou Shalt Have Backups, but it's not exactly a frictionless experience if you fry things and have to go back to them.
I do wonder how much of this is because of, specifically, PostgreSQL not having enough "market" penetration. For a very long time, it's had very good unclean-shutdown resilience with the way it does its write-ahead logs and has even had point-in-time-recovery (using those same WALs, just stored longer) for several years.
Combine that with a snapshotting filesystem, and that would seem like a pretty low amount of friction for recovery, without even having formal backups.
If your database is big enough, you need a certain amount of performance out of it, or you need it to be a recent enough copy [1] of the production data (which likely means "all of the above"), then it does start to get legitimately challenging.
[1] or, worse, something ETL'ed (anonymization comes to mind)
If you have a quick script to initialise the db into some known state, and are defining all the schema changes as migration scripts, and maybe have access to something like pg virtual to stand up temporary db test environments, you could probably get pretty close to this by just version controlling all the scripts required to stand up a db and apply all the migrations.
Check out Delphix (https://www.delphix.com). It provides a ZFS-backed copy of your production database per developer using snapshots, with rollback. Disclaimer: a friend worked there, I haven’t used it personally, and it’s not free.
Rolling back on dev only is still bad. That is developing and testing against a version of the database that you know will never be prodlike. The dev database should be created by snapshots or forward migrations from scratch.
I think you've misunderstood what ActiveRecord migration rollbacks are for. Nobody is 'blowing away' customer data. A couple use cases are:
1. During development. Named that column wrong? Included a column you realize you don't need before you merge that branch? No problem! Just rollback, fix the migration, and do it again.
2. During deployment. It gives you a sane way to rollback a botched deployed.
Whatever their intended purpose (by the developers, I assume), that has no bearing on how they're actually used.
> Nobody is 'blowing away' customer data.
That's a remarkable claim. Although it wasn't ActiveRecord specifically, I've seen this happen with a similar migration/rollback mechanism. Granted, it was only a few hours of customer data, and it would not have existed in the first place were it not for the initial upgrade/migration (subsequently rolled back), but the data was accepted and subsequently lost. Good thing there weren't any SLAs (yet).
> 2. During deployment. It gives you a sane way to rollback a botched deployed.
Even this use case can fall into the data loss trap in my example above. What if nobody realizes the deployment was botched until hours later?
Not to mention that there are enterprise tools for rolling back version changes including database migrations, and enterprises which enforce their use, so the rollback scripts are essential.
I remember vividly a conversation with a developer whose proposed rollback was that we just move the data to a temporary table. And when our customers call us and ask where their money's gone, do we tell them it's safely in a temporary table?
Well, in an ideal world applying an active record migration backward and then forward again should be a no-op to database contents. This would require storing removed columns or tables somewhere, but there's no reason that couldn't be done by the ORM.
Almost nobody does this, but perhaps they should and ORMs should make it easy to do or even the default.
> This would require storing removed columns or tables somewhere...
Simpler, you have an underlying set of base tables, and each "migration" is simply creating a new schema with a set of views / stored procedures over the base tables.
Then your migrations will be forward only for the base tables, and constructing a new schema of views over them for the new version. Older versions of the app continue reading their original views.
And once you've decided a version has been deprecated long enough and any legacy data or indices can be safely removed, you'd move the earliest valid version forward.
That approach, of course, means you have to be able to come up with a view of the data consistent with the existing base tables, of course, and the base table migrations have to be compatible with the oldest valid schema version.
And you run into other sticky questions like adding constraints: if v1 doesn't have a constraint, but v2 does, do we:
1. apply the constraint to both and v1 is now going to have to deal with errors
2. apply the constraint only to v2, and it's sometimes going to see invalid data from v1
3. apply the constraint only to v2, and invalid data will be hidden
It's great in theory, but in most databases I've used, if you get too crazy with views your performance goes to crap. The optimizer just can't handle it.
Views are great for specific use cases, but if your interaction with the underlying tables is entirely through views I've never had that work really well.
Chicken and egg problem. It's inefficient because no one does it, not because of fundamental computer science problems. If Active Record et al adopted it, I guarantee you database vendors would make it efficient.
No, you shouldn't use a code first, or a database first design.
The article explains the problems with the code-first approach very well, but there are just as many problems with the database-first approach:
- There is never just one database. There will be your production database, a staging database, one or more testing databases, local development databases, etc. This makes the database useless as a single source of truth.
- When you write code against your development database, there's no guarantee that it will be the same as the production database, and you shouldn't be able to connect to your production database while developing locally to verify that. Usually you want to deploy identical build artefacts to staging and production, so there's a danger that your staging database gets out of sync with production (eg. after an issue was caught and improperly rolled back on staging) and then you deploy your fully tested code from staging to production and your entire system breaks down.
- When you need to restore a backup of your database, you need to somehow find a compatible version of the code to deploy as well.
My recommendation is to use your migration scripts as your source of truth. Each service should access separate schemas within the database, or at the very least separate tables. If you really need multiple services to share tables then those services should be versioned and deployed as one.
The database should have a table which stores a list of the migrations which have been applied, and you should use a migration tool such that when you deploy, outstanding migrations are applied as required.
If you want something easier to read than the history of migration scripts, then you should keep an up-to-date schema as well next to your code, and verify as part of CI that running all the migrations results in that exact schema.
It can help to establish rules such as "new DB migrations must always be compatible with old code" aka "migrations first". This gets everyone into the mindset that a db+code change must be carefully staged in a way that both (new DB, old code) and not just (new DB, new code) are tested. If the migration goes wrong due to different data in prod, it's easier to roll that back before new code is deployed.
In the very early stages of development it doesn't really matter how you build your database schema, but spending some time getting the data model as clean as possible it pretty important.
Agree that scripting is important once data moves to production - once that happens, the database in Prod is master and the deploy script controls not only the code being pushed but also any database changes.
If you are lucky and have planned the initial layout, you'll never need to remove columns or tables, so a single deploy script can do it all (update database schema and push new app)
From a startup perspective I strongly disagree, I think the UI should almost always be designed first. The HTML/CSS/JS frontend that the users see comes first, and have it hooked up to an empty backend that does nothing. Then once the UI requirements are fleshed out and the final specs nailed down, only then the database schema should be designed. The backend glue e.g. Django/Rails comes last.
I find this approach has led to the fewest number of code rewrites and database migrations needed to get from initial mockup to finished product, because UI requirements can change much faster and are more fickle than DB schemas and backend code.
I agree that the UI / user experience needs to be spec-ed out first. However, in terms of actual development, writing the DB schema's first is usually the right call. I typically feel that things should be designed outside-in, and developed inside out. Of course those are just general principles.
This is brilliant, totally stealing this. Reading the other threads found myself agreeing with both "design UI first" and "design DB first" and couldn't reconcile the two. But if you split it into design and develop it makes total sense.
I totally agree with this. On my experience that is the approach that works best.
1 - Define the requirements
2 - Design the UI AND its behavior
From seeing the UI and understanding how it is expected to behave you can get a clear idea of what entities and attributes you will need, from there you can derive the database design.
3 - Design the database in a way it fits all the data you need to display on the UI, and all the data you don't need to display but need to make the UI behave in the way you need it.
4 - Code the business logic required to link the database to the UI, in whatever language you choose.
Schema design and UI design should both be done at the beginning and designed _concurrently_. UI requirements can inform schema design. But there's always going to be an impedance mismatch between the two: schema design optimizes for performant storage and retrieval of data - it's data that has been optimized for machines. UI design optimizes a users ability to conceptualize and use data - it's data that has been optimized for people. Code is the glue that helps mediate between these domains.
I have consistently found that this approach leads to data structures that are insane to deal with in any kind of scalable, performant way, unless the front-end developers are disciplined and reasonably knowledgeable of how algorithms and databases work. Mileage may vary.
I agree. I'm a relatively junior developer (<2 years) and was told to create a proof-of-concept UI before the web api or db were designed. When the team eventually moved to build the supporting service and db, so many headaches ensued as certain data contract decisions that made sense for the UI don't necessarily work as-is for the service/db. Learned this the hard way
My two cents: always source control your databases.
I don't care if you are code first or database first, but I do think database schema and migration scripts are code artifacts and need to be source controlled somewhere.
You should be able to track who made a change and when and what for, and maintaining by hand a "commit list" in a comment at the top of a stored proc doesn't count.
You shouldn't need a full backup of a Production database to spin up a new test server, and you shouldn't need to rifle through every developer's documents and desktop folders looking for random .sql script files to figure out if you are missing a migration/update/refactor script somewhere.
(The argument that you rarely spin up a new Production server is I think tempered by how many developers and testing environments you need. Setup time for developer/testing instances indicates a part of how long onboarding new developers may take; and here I lean to the side of developer productivity and good backups.)
Most code first methods give you source control "for free", which makes them easy to recommend to junior developers / small startup projects.
Visual Studio has decent SQL Server Project support to help with the database-first crowd, as one option I've relied on heavily. Unfortunately, I've yet to see an open source, more database agnostic relative to it.
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
I've gotten the feeling that flowcharts used to be considered quite important by lots of coders, while today they are only used in particular subcultures. A flowchart is sort of an ungainly artifact. It repeats information that is in code, but is more difficult to produce and save than code is.
I've found them worth the effort pretty much only when trying to explain to stakeholders that they have given me conflicting requirements. When they can follow the logic path with their own eyes they are usually able to figure out what they actually want
Nowadays code is (relatively) easy to read. Back in the day of thousands of lines of assembler, perhaps on punched cards, less so. I think that's why flowcharts were more important in the past than they are now.
Anecdotally every flowchart loving dev I've worked with has been utterly useless. If someone's first step is to create a flowchart of the entire structure of the project then I immediately know they are going to be a disaster.
This has certainly worked for me on smaller scale, but when things get big, it seems to break down. In theory, you should be able to look at subsets, but I've rarely seen things well factored enough that this worked well in practice.
I can't tell you how many times I point out the data model is flawed (even when it should be obvious) only to have the project turn it into something you don't want to step in.
The most recent one was clearing of multiple work orders with one invoice. Yes you can, but you shouldn't.
If that was a business requirement then developers must find a way to implement it. All we can do is to point out the extra costs of the implementation.
To add to this, usually when I see a mess of a data model it's not because it started that way but rather it's indicative of changing business requirements with not enough time given to engineers (or more rarely lack of experience, I guess) to refactor / redesign / migrate things properly. Or perhaps original creators leaving with no record of why things are built a certain way so no one wants to touch it.
One use case for when you may consider code-first - pure experimentation and fast iteration around the data model that supports your application.
I have used both models, and did database-first extensively, for at least 15 years. However, having been working in .NET lately and having EF Code First migrations (painful, but read on), I did discover a sort of hybrid model that worked.
Eventually, I believe you will always end up at database-first, even if you start code-first, unless your database is relatively small (by # of table/procs/views count) or limited in scope. But as applications grow, multiple databases become a reality, the database might grow truly huge(# of tables/procs/views), you might need customization to your database platform, etc. - limiting it by the application / code-first becomes impossible.
Where code-first works well, for me, is limited scope OR, more importantly, when you need really fast iteration for prototyping and don't want to think about building the tables. I've found that use case, code-first can be really helpful to think in code, let the system generate the database, at least enough to get to an MVP or PoC type-level of work product. Then, once it's working enough, to really take it all the way, switching to a database-first approach to do all the customizations to a particular database platform's strengths, etc. can work - I've done it myself on several projects successfully.
The tooling is almost there to do this... just start code first, and then once you have your system bootstrapped, generate the database first model and switch. But, you have to be careful with naming things so that names of things in your code match up (or can easily be updated) if you switch from code-first to database-first - something that not all code generators let you control completely (most template-driven ones do, though).
If you're using PostgreSQL, I invite you to try PostgREST https://postgrest.org/en/v5.0/intro.html which takes the database-first approach even further, with PostgREST you develop all your business logic in SQL, there's no code generation, no ORM, which no matter how elegant will always impose a performance and abstraction overhead(to do an aggregate you have to read the ORM docs, learn/understand the abstractions, add the code, all of this in addition to what you already know how to do in SQL).
I couldn't agree more. The code first design is what kept me away from Rails. I'm sure they have sensible approaches as well, but unfortunately I couldn't get over my prejudice that anyone who teaches the creation of the database through the app's code instead of actually designing the database is someone I don't want to follow or learn from and whose framework I don't want to use even if it provides alternatives (I hope it does). Database design in a CRUD app is by far the single most important task. Show me your code and I know nothing, show me your data structures and I understand the whole app never applied more.
The consequences of not following this advice I see everyday at my current job where we are stuck with an ORM that needs at least 3-4 files modified for each entity change and that requires management using an external GUI tool. The ORM adds 10x overhead to already unoptimized SQL queries so optimizing the queries is never a problem because the system's overhead is so slow even if it was possible to make the queries run in zero time, the overhead would still kill us. 10x. Query is 10ms and overhead is 100ms. And the trade-off for this? Nothing but a shitty ORM that makes no sense (Doctrine 2/PHP) and slows down development a few times over as well. Changing even the name of a column requires editing multiple files. We have camelCase, snake_case, and kebab-case for all our entities and depending on where you are in the system, a different case is required. Using the ORM to generate entities after you're in production? Don't make me laugh. That's impossible. And this isn't even the worst ORM I've used (though it's close). Everything the author mentions and then a whole bunch of stuff that's a million times worse has resulted from this drive to abstract away the most important component, the database, and not have it be the single source of truth. Frankly, it's fucking stupid and beyond maddening, not to mention slow and due for a rewrite. Never again will I tolerate this kid of design that ORMs encourage or the use of ORMs at all, especially if they replace solid database design. I'll leave that kind of fuckery and stupidity to the rookies.
What’s stopping you from designing your database in Rails? You can spend all the time you need designing it, and then just write your design down in migrations/models.
Rails wants to generate the database for me rather than generate the metadata it needs from the db. That's how it's presented by default anyway. I'm sure one can use it the other way, it was just never obvious how and I never bothered to pursue it. I shouldn't have to write a class for each db table or maintain other such boilerplate nonsense that easily gets out of sync with the db. The database already has sql which is more powerful and useful than any db framework I've encountered so far. Why would I turn away from that for a framework that has its own dsl? That's crazy. Rails is just one example of such a design, a design I try to avoid as it leads to pain and in my experience terrible architecture and code.
I've never used Rails, but I will be surprised to learn this "migrations/models" abstraction allows the full range of foreign keys, indexes, triggers, constraints, stored procs, etc. without which we might as well just be on NoSQL... Not that there is anything wrong with that in principle, but in many situations that's not what we want.
Rails provide ruby DLS for migrations but you can always put plain SQL in that migration files. I do this most of the time and the Rails conveniently tracks which script needs to be executed against particular database state and runs them for me.
I find myself rarely do triggers. Not because I can't. It's just more handy to do on the ActiveRecord layer having access to all my ruby code.
Indexes, indexes on PL/pgsql procedures, foreign keys and constraints – they absolutely have their place in Rails applications.
Although you could do a rather complicated SELECT's with ActiveRecord, sometimes, for particular queries, I prefer to create a database VIEW and then point the ActiveRecord model to it.
> Although you could do a rather complicated SELECT's with ActiveRecord, sometimes, for particular queries, I prefer to create a database VIEW and then point the ActiveRecord model to it.
Rails migrations give you `foreign keys, indexes, constraints` but not `stored procs` - there are hacks to make it work with stored procs if you really really want to. `triggers` are rarely used in rails applications, but nothing to stop you from using them.
The rails model is an ORM (Active Record) and makes working with relational databases a breeze.
Ultimately you can write Rails migrations in SQL. That's a fact.
There are two problems with that.
One is that I discovered with surprise that some developers, not only in Rails projects, don't know SQL.
The other is that you still have to synchronize those migrations with changes to the models of the framework (example: add belongs_to, has_many, etc.) even if ActiveRecord autodiscovers the fields of the tables.
The latter is true even if you write the migrations in Ruby and is true also for any pair of framework / language I know. Think about a project with microservices in many different languages and it can easily become unmanageable. "We added a foreign key from here to there, everybody update their representation of the schema." Multiply for a few times per day.
> The other is that you still have to synchronize those migrations with changes to the models of the framework (example: add belongs_to, has_many, etc.) even if ActiveRecord autodiscovers the fields of the tables.
Yeah, this is the part that's the problem, not the migrations themselves. Keeping this boilerplate in sync with the db is the bane of my existence in my framework and is a completely unnecessary task. Using the ORM itself is even worse, but I digress. In apps I started from scratch, I dropped all that garbage in favor of repositories that send straight SQL to the db and all these issues dissapeared, including the slow 10x query overhead.
> The latter is true even if you write the migrations in Ruby and is true also for any pair of framework / language I know.
Maybe it's true for Rails, but there are plenty of frameworks that work fine without ORMs, ActiveRecord, or other such nonsense. To me, such systems are a step back from straight SQL and add unnecessary overhead that only gets in the way and slows down development. Having the SQL in a file and auto-generating functions on top like HugSQL does in Clojure (https://github.com/layerware/hugsql) is by far the most optimized system for db access I've seen. Other languages have these types of libraries as well. Another option is query builders. It's simply false that a db access framework needs to be configured with boilerplate based on the db schema. Yes, that's the case for the idiotic ORMs, but certainly not the case for other libraries which are available in pretty much every language.
> The latter is true even if you write the migrations in Ruby and is true also for any pair of framework / language I know. Think about a project with microservices in many different languages and it can easily become unmanageable. "We added a foreign key from here to there, everybody update their representation of the schema." Multiply for a few times per day.
Isn't the point of microservices to wrap the database so that other applications don't have access to it? And if you're at a stage where you're rapidly iterating on your data model, maybe start with a monolith where iteration is less costly?
ActiveRecord isn't really an ORM. It's a way of building queries against relational data, and associating some methods with the tuples you get back (which might not even be full rows). It also has a bit of support for C, U and D, but it's strongest in R. Although many queries won't fit in its model, it's trivial to inject snippets of SQL where needed.
What it is not is a way of mapping databases to models. The models pretty directly model the database, rather than the other way around. Quite different to something like JPA.
Can you just not use Doctrine and make hand coded models that are optimized for your use case? Or make a microservice that accesses the data, and query that API?
In our api, I ended up rewriting most of the db access to not use it but it's not as performant as it could be due to needing a custom serializer and naming requirements in place. Also the code is now half orm, half sql. The other app, I couldn't do even that. It's just too ingrained into the app sadly. At some point some of it will have to be rewritten, especially the api and I plan to use clojure where this issue won't exist.
Not at all. Sitting down, thinking through, and designing an actual schema vs. writing a bunch of code first for entities without thought to the big picture, design of the database, or how this will be implemented in the actual db.
What I want is a doctrine that gives an ideal approach for data management, similar to how 12 factor gives an ideal approach for overall application architecture management. Data architecture, even though it's a subfield of application architecture, is important and vulnerable enough to need it's own doctrine.
What I like about 12 factor is that you don't have to get it all done at once. The concepts start high level and you can dive into the details as they become necessary. Initially, you can get by with dotenv for configuration management, and slowly build out more tooling as you need it.
But data management seems to need a new lexicon and set of organizing principles. How do I think about the varied aspects of database design in a way that I can be reasonably sure won't bite me in the ass before I can get around to scaling it? How do I pick a database, what are the relevant decision criteria? At what point do you want to introduce different kinds of databases, like say a graph database or ElasticSearch?
Undoubtedly, Designing Data-Intensive Applications is a great place to start: https://dataintensive.net/
It’s not as purely distilled as the 12 factor app by any means, but amongst all the details about B trees and such, a number of very solid principles around storing data creep out.
Jooq has become my favorite ORM with how bare metal it is. I have it set up in gitlab pipelines to auto build my schema into an ORM. Just kick build on a sql change, and then inherit the new package.
I saw an ancillary remark about prod connections, being bad. I generally agree. I usually build mine off of my staging environment. Assuming a staging database schema, was promoted to production. Then I have a sister build job, that kicks off my jooq build, trigger on the promotion task.
The biggest thing is the professional license for several databases. But it's not that bad really.
It also goes to one of my thoughts as well. The data model is crucial to the application. This is like the foundation for your home. I always try and hammer in talk about the data model first. Where did it come from, where next, and finally where do you want it to go.
If JOOQ is anything like LINQ-to-SQL (and it certainly looks inspired by it), then it isn't an ORM. It's a typed query language with a one-to-one mapping to SQL database tables/views. This gives us the low-level control of SQL, but with the ability to build our own abstractions and compositions using the host language. ORM's are frameworks (bad), JOOQ is a library (good).
You're exactly right it's very reminiscent of LINQ. It's pretty much raw sql exposed as methods. I just say ORM, because saying near bare metal sql. Bothers a lot of developers.
Note the above is not totally correct, LINQ-to-SQL can be used as an ORM if required (i.e. is not restricted to one-to-one mappings from tables to classes).
Sometimes, I get the feeling that we're all in the story of the "blind ones and the matter of the elephant". Design UI first? Design data model first? Design code first? Write documentation first? Write tests first?
How many things can we actually do "first"? All of them are important. Writing documentation tells you about what you might need implemented. So does designing UI, but that might still leave out what users need. Which means we probably don't know enough about the data model anyway. Or the code.
So ... here is one more blind person's view - iterate as quickly as you can on everything.
I'm not sure that the statement "All generated DDL is only an approximation of that." is necessarily correct and is database engine/server-dependent.
It's typically pretty easy to reverse-engineer the production database schema, including any vendor-specific extensions, into a SQL file that can be used to recreate an empty version of the same database schema. Furthermore, it's just as simple with most database engines/servers to use an automated tool to compare two schemas and generate a DDL script that will upgrade the old schema to the new schema. It helps immensely because such tools will avoid circular dependencies with foreign keys, etc. that may trip you up if you're trying to manually write such upgrades.
Other than that, this article is spot-on. The database should be the source of all truth and yes, you need to become proficient in SQL to be a good general application developer.
I know in the HN comments they'll be a whole bunch of "you don't do design when you're trying to iterate fast". I wholly disagree. Design phases expose problems in your thinking that you'll be tripping over very shortly, except now you didn't waste the time implementing something that wouldn't work anyway. Skipping the design phase is "productivity theater", where it appears you're being productive because code is being written, but in reality, you're just being busy.
The database really is an API, in that client code depends on it and its structure will be difficult to change once you go live, whether you want that to be true or not. Writing the API before code that depends on it seems like an inherently good idea to me.
You can pretend like you don't need to figure these things out first and get away with it for a while, but at some point the lack of a solid foundation is going to bite you.
I think all devs should answer the question, "What if they don't want to use my UI or interface, or they want to automate something?" That's a good scenario to support for any client, including yourself.
I think it's really interesting that I find this insightful:
> The database really is an API, in that client code depends on it and its structure will be difficult to change once you go live, whether you want that to be true or not.
And then have a different conclusion than this:
> Writing the API before code that depends on it seems like an inherently good idea to me.
I'd guess it comes down to differences about the value of top-down and bottom-up approaches to systems.
My experience is that top-down thinking usually is a better guide to creating the API you actually want to call -- writing the API calls you wish you had in the flow of the code, and then making them real.
But then again, that's application-specific context, and may not lead one to answer the question "What if they don't want to use my UI or interface, or they want to automate something?"
Perhaps I am too new to back-end development, but what strikes me in this article is the misconception that code-first is going to give you a bad DB schema, and that migrations will loose data.
I think part of it could be blamed on the tools: from the code I see, the ORM doesn't seem well-built, if it needs that much code to generate a proper DB schema.
For the record, I'm using Django (for about a year, I was developping apps before), and my workflow is:
1) code, while thinking about what the DB schema will be as a result
2) generate the migrations
3) apply them and check if what's generated is good and usable when you need to directly query the database (mostly for dev purpose). If need be, either change your code, or even manually modify the migrations. Migrations are code too, they are there to be overriden when required.
Sure, your rollbacks can loose data, but only if you don't pay attention to them and don't change them to avoid that. As a developer, it's your responsibility to think about that if it's of critical importance to your business.
Feel free to enlighten me if something I say seems very wrong, after all I'm still quite junior at this, and have not experienced huge scales / large team.
The article seems to advocate generating code from a live connection to the production database.
This seems very risky. Typically you don't want developers to even have access to the production database most of the time, let alone make it a hard dependency to do a build. Development work should be done using temporary databases.
Also, database schema migrations need to be tested somewhere other than the prod database before going live. All the code changes needed (including to applications) should be written and tested before actually doing the migration.
But this approach would be reasonable with a bit of indirection: one developer (DBA?) makes a database schema change, resulting in a change to a schema file that's checked in (after testing), and code gets generated from the schema file. That way developers all have access to a history of the prod database schema, without access to the database itself.
(This is pretty close to how things are done using protocol buffers.)
I don't think anyone advocates for generating based on a production database connection. I can only tell you how I've used jOOQ, using the same general setup for 6 years despite job changes. I wrote the original Gradle plugins for jOOQ and Flyway, the latter being a migration tool. These were then contributed back and I use the official variants now.
I write SQL migrations for H2, an in-memory db. This is used by the build tool to code generate prior to compilation. It is also used by unit tests, when the Postgres queries can be emulated (majority of the time). The Postgres migrations are written separately, with similar code but minor dialect differences. That might be less duplicated (but trivial) work if I used LiquiBase instead of Flyway. When the application starts the migrations are run. Someday I might drop H2 for a containerized Postgres, especially if they add in-memory support for faster testing. This would avoid some codegen differences, like jsonb or (soon to be released) enum support.
The work setting things up wasn't too bad. It was more when I originally started, since I had to build part of the ecosystem. But repeating the process elsewhere has been straightforward. It definitely pays off and is a joy to use day to day.
Definitely no one runs jOOQ's code generation from the production database. Apart from the security concerns you've mentioned, that wouldn't make any sense, technically. When you implement a database increment, you will want to update your client logic immediately. That only happens on the development environment. Then you check in both the DB increment and the client logic change in version control for others to apply to their development databases, and ultimately to ship to production (with the generated code being pre-generated)
A couple of fundamental principles here:
Systems tend to outlast their developers
Data tends to outlast the systems operating on it
This is why I'd almost always go with database first design.
As a general rule (and test), a well designed database is useful outside of the context of the application that accesses it for persistence.
As an example from the kind of apps I've worked on: suppose you are writing a system for tracking and managing inventory, which will be used through a web interface. If you have direct access to the database, can you get useful information about orders, lead times, existing inventory levels, and so forth? If not, you don't have a useful database, and you are at risk. Migrating to a new system, which is probably inevitable down the road, will be very difficult as you no longer have control over your data.
Now, apps are useful, and code is useful. I wouldn't want to try to code up various supply chain management algorithms in SQL directly. The estimated lead time can be quite a calculation, and yeah, you are probably going to want to write python or ruby or java code to do that. But in the ruby world (with rails in particular), I have noticed databases used almost as a kind of object serialization and persistence, where the information makes almost no sense and can't be queried in any meaningful manner through the DB. It is simply used to store and retrieve information that needs to be drawn into and reassembled through various bits of code and configuration files in a rails app before it takes a form that makes any sense.
In short, the database has no meaning, almost, outside the rails app.
Now, I'm not saying this to knock rails. I've used it, and I enjoyed the experience. Metaprogramming, in fact, makes it very easy to map an object to a database table without a ton of irritating configuration or extra code. You don't have to generate your schema through rails to map models to tables, you can create them independently of the app and use rails to do the mapping for you. Also, it's entirely possible to use rails generators to create a very sensible database design. The problem is probably that a lot of people using these generators never considered databases or SQL or schemas as something that had meaning outside the context of a rails app. My guess is that this anti-pattern happens in almost all frameworks that allow developers to quickly push objects with relationships off into a database schema that they never took the time to consider or design. It's almost no different than if they serialized the objects for persistence, and used a hash to find and retrieve them.
I also think this emphasizes why clarity is so important in code and database design. Sometimes people talk about upgrading an existing code base, and say "well, it's already written in [java/ruby/python/...] , so it'll be easier if we stick with a [java/ruby/python/...] based framework". Nah. If the code lacks clarity, it'll be hell to port it from python to python. If it has clarity, it won't be hell to port it from python to ruby.
If you can use your database to answer meaningful questions about your data outside the context of an app, your database is not only useful, it also probably has clarity of design and purpose. If you can easily read your code and know what it is intended to do and why, you have clarity.
I'm amazed with how often developers are willing to sacrifice clarity to meet the demands of a testing framework, or a particular architecture, or to accommodate a cutting edge UI/javascript approach. These can be good things, but you know that shit ain't lasting, and then, without clarity, where will you be?
Well, we all know, the developer who gained experience with the nifty new framework will be: at a new, better paid job.
Beyond that... eh, I suppose a code generator from a database can be useful and save you some typing, but if it's all cluttered and clunky and lacks clarity I'd rather just do it myself. It's not really that hard.
Really great post. Code generation is seen by many as old school, but this post shows it's still well alive. We are seeing a revival of code generation with tools like jooq, Protocol Buffers, static website generator, and some Go tools for example.
My favorite version of this, is when a new version of something is greenlit by the business, and a bunch of enthusiastic devs look at the old system and decide to rewrite. Instead of starting with the existing data, and determining how it's going to fit in the new design, they just make a new design and figure, "we accounted for most business requirements, so the data should be able to fit...". Then they set a launch date with the business, and just need to do that data migration. What could go wrong?
This goes with any system. Get the data pipelines hot then drive in the business logic. Otherwise you are surprised with latency and have to do major refactors.
We get very used to the ability to blow away everything and start from scratch, and that's very often the right answer for build tools, but the database is fundamentally different from that, and you need to internalize this.
A database, in most cases, is the business. It's essentially a collection of all the contracts made with customers, and your mode of thinking "I can blow this away and start over" that you've learned from all sorts of build tools is completely contrary to that.
An infamous example of this is in Active Record migrations[1], and possibly some other migration tools.
> Note that we define the change that we want to happen moving forward in time. Before this migration is run, there will be no table. After, the table will exist. Active Record knows how to reverse this migration as well: if we roll this migration back, it will remove the table.
These are smart people, and indeed there are some rare cases where this is the right thing to do. But only our strongly ingrained belief that data is ephemeral would cause us to think this should be the normal behavior, that we'd want to blow away customer data to force the database to track what version X of the code thinks it should.
[1] http://edgeguides.rubyonrails.org/active_record_migrations.h...