> but few … nay, none who could tell me what the properly normalized data model was
I guess we have an entire generation of software engineers who never had to cram the degrees of normal forms for an exam? [0] The term "Boyce-Codd Normal Form" is etched in my neurons.
For anyone who seriously studied relational databases the normal forms are very clearly defined. This is not some "unknown" used to justify arbitrary design.
Likewise anyone who has seriously worked at delivering data at scale in an efficient way has consciously considered the trade-offs between normal and de-normalized data.
Furthermore the reason for Normalization isn't because it's pretty, but because it can drastically reduce the complexity of updates when you have duplicated, but mutable, data as well as reduce overall storage space back in an era when that was a major issue.
I don't know about an entire generation. A database class which was mostly about normalization and relational algebra was a requirement for my degree, and I graduated fairly recently. I wouldn't be able to recall every form and their requirements, but the general principles are etched into my thinking. It's almost "common sense" to structure data in BCNF, even if one doesn't recognize it as such.
"Database Management" was an elective in my Bachelors in Computer Science program (Class of 2018). That's the only time we discussed normalization, iirc.
There is some confusion in this thread about what normalization means. Here's a quick intro for the young 'uns. It refers to a bunch of constraints or suggestions about how you should organise information in a relational database.
The basic requirement is that each logical attribute goes to a separate scalar-valued column. Keep zip code separate from address, for example. Don't have arrays or json-valued fields. If all you do is obey this constraint, the db is in a "first normal form".
The next set of constraints refer to the keys. As a mnemonic, you can think of it as a variation on "the truth, the whole truth, nothing but the truth". As in, "the key, the whole key, and nothing but the key".
"The key" refers to a constraint that all other columns must functionally depend on a singular ("_the_") primary key column. There is no point having two columns that are each unique ("emp_seq_number_when_joining", "emp_id") and can be used to look up the rest of the related data.
"The whole key". The dependency must need the whole key, not partial. For example, both employee-id-year and employee-id are unique, but the latter is sufficient to look up the rest of the employee info. Only have so much info in the key that makes the row unique; any extra goes into a separate column.
"Nothing but the key". There should not be any other dependencies on any other columns. For example, one can have "department-name and "department-id" as columns. This is redundant. Department-id is sufficient to obtain the name. In practice one has extra columns to avoid the expense of joins. This is classic denormalisation.
The is the basic set of requirements (called "third normal form" or 3NF) to call the database normalized. For the most part, most dbs are in 3NF. Denormalization is done for performance reasons, typically to avoid expensive joins. For example, one could squirrel away a tuple or an entire json document in a db cell. Or have redundant column info as illustrated earlier.
The article is simply a reminder to be serious about normalization, and cautiously introduce optimizations with appropriate justifications in documentation.
I have had pretty good results over the years following the mantra “normalize until it hurts, denormalize until it works”. Start with as little redundancy as possible, then carefully consider anything you want to denormalize and plan around the complexity it introduces. Materialized views can also help a lot with this.
The main issue I’ve run into following this approach has been with ORMs that don’t provide nice raw sql escape hatches–many times something you’ll jump to denormalize is easily solved with a better query, if your tools let you do so.
The mantra is a good reminder but poor at explaining the why.
It's sort of like lego blocks or pre-fabricated wall / roof sections for homes.
Fully normalized everything is broken down to the smallest possible unit of data (a value) that can be accessed by a key. One 'key' might be happen to be the same (shared) among many different kinds of values, in different tables.
That's like taking something apart until you have individual lego pieces or beams of lumber. This is a good logical 'what if' step to help someone ask if data __really__ belongs together.
Does the data really belong together? If a query wants (id, x) how often does it also want (id, y) or (id, z) etc. These pieces should be combined to larger tuples / object fragments if frequently accessed together.
De-normalize returns the true data to a format that is efficient for the system and it's use cases by preparing what's most commonly going to be seen together in that form.
Please explain how it's wrong __in the context__ where 'system' is the entirety of the database and consumers (applications / etc) of the data in that database producing results? Note the qualifying portion you truncated, "by preparing what's most commonly going to be seen together in that form."
> Does the data really belong together? If a query wants (id, x) how often does it also want (id, y) or (id, z) etc. These pieces should be combined to larger tuples / object fragments if frequently accessed together.
> De-normalize returns the true data to a format that is efficient for the system and it's use cases by preparing what's most commonly going to be seen together in that form.
> DB guy with 25+ years experience. Summary: it depends.
>> joins are never cheap
> it depends. On table size, indexes/table size vs how expensive the alternative is. Always!
>> tables with billions of rows crossed with millions of rows just to find a single row with data is not something i would call cheap
> indexes
>> more often than not it is better to avoid joining large tables if you can live with duplicate data
> 1E9 x 1E6 = 1E15 (at worst anyway). A join via an index will save you colossal amounts of IO (though as ever, it depends).
> Problem here isn't this mostly clueless advice (discarding/archiving unnecessary data is the only good idea here, and it's not used as often as it should be). Problem is strong opinions put forth by someone who doesn't have the necessary experience, or understanding of what's going on under the hood. Denormalising is a useful tool that IME rarely gains you more than it loses you, but this 'advice' is just going to lead people down the wrong alley, and I'm tired of suchlike n00b advice strongly (and incorrectly and arrogantly) expressed on HN.
> (edited to fix maths error)
First, your target audience is humans reading this forum. It is better both for the flow of the conversation and for digital preservation to JUST COPY IT AGAIN. Even if it's an exact quote, it saves everyone the bother of loading an old comment from outside of our current work context. That doesn't advocate for the duplicate data (in a database) that you mention in your post. We (as humans) aren't accessing this like a database following a query plan, we're parsing it as a tree structured conversation.
This is also exactly the sort of thing I was talking about with 'de-normalizing' the data structure back towards that's optimal for the overall system as a whole and maybe what you're missing as the forest for the trees. Maybe we're even trying to say the same thing, but the words sound too similar to someone who doesn't get it the same way you do for that to come across clearly as being the same idea, just stated differently.
Your nebulous ~~ 'it depends' ... ''' examine what the different costs and alternatives are ''' always ~~ (paraphrased slightly) is very similar to what I said as "... data to a format that is efficient for the system and it's use cases by preparing what's most commonly going to be seen together in that form."
Though where we differ is on the duplicate data issue. Duplicate data might be tolerable for extremely read and LOW COHERENCY pressure environments. An example that comes to mind is email, or an eventual consistency acceptable stream of changes such as a large filesystem with mostly long lived files, some temporary / growing files, and new files, along with some sort of change set that can accommodate versioned file references and update operations. A bank account with a lot of headroom and mostly small transactions might also be a good case. All three examples work much more poorly where ACID compliance might matter, such as when there isn't that headroom and where the exact sequence of operations matters for system / organizational rules.
Even in systems where materialized views (an in database type of cache on a view) are usable, it still makes sense to figure out what the most frequent and maybe (if it matters, it might not if it's a report run once a day that's still fast enough!) the worst performing queries want from the database and group the data in a way that logically works best for the relevant use cases, rather than the formal and ideological fully normalized representation of the data. The more general advice of break it all down (fully normalize in theory) then built it back up (de-normalize) gives what's often a good enough first approximation to the solution.
Both of us, probably the GP poster, and probably most of HN readers, agree:
I think what threw me is this bit "De-normalize returns the true data to a format that is efficient for the system" as it seemed unqualified i.e. there was no 'it depends' that I could see. Your longer answer makes clearer what you were intending.
So yes we can definitely agree on measuring and de-normalising where appropriate. Thanks for a nice reply!
Denormalization and nested data structures. Sometimes we need to spend a little more time thinking about data models rather than just dumping the original json into mongo and working around that for eternity.
Mongo is two helpings of Chesterton’s Fence and a cup of JBoss astroturfing.
They bought a company that had a better architecture and renamed it MomgoDB, true, but you can’t reward origin stories like this. Permanent banishment.
Oddly, I'd argue that normalization is often the "simplification" that is not accurate to how things are. Mayhap an example could show that I'm thinking of different ideas, here?
My gut is that I'm thinking of what is normally called "materialization" moreso than just "denormalization." The idea being that what is denormalized is largely read data, and doesn't offer direct editing of the items that are copied. I view this as a bit more realistic for a lot of scenarios, as well, as that is likely what is happening to the items being represented.
This is especially true for any inventory style database. There are physical things backing the data, such that if you have normalized the representation, you are doing a simplification for yourself in the data. You are not more accurately representing it.
Normalized means free[er] of redundancies. If you have a shopping basket, then you need the user it belongs to, the products in it and their quantities. Everything else like shipping costs and total costs can be derived from that. If you explicitly store the total costs because it is faster to show a value you already have instead of calculating it from the product prices and the default delivery address each time you need it, then you have a denormalized data model. It is now faster for the common operation of showing the total costs of a shopping basket but at the cost that you now have to invalidate shopping cart total costs when you change the price of a product or else the total costs might not reflect the current price of all the products. In a certain sense denormalization is just a form of caching and come with all the same problems as other caches, most notably of course cache invalidation. Materialization - as in materialized database views - is an mechanism to automatically denormalize your data that [hopefully] also ensures that the materialized data is always up to date and consistent with the rest of the data.
I added something about materialization while you replied. Materialization is controlled automatic denormalization. I would guess it is usually not labeled as denormalization because the materialized data is not really part of your data model, it is more a cache of precomputed data that is automatically kept up to date and consistent. In contrast denormalization in the data model proper usually means that the code has to ensure that things remain consistent.
I got confused, as I thought you meant you edited the linked article. :D
So, yeah, this all makes sense. My assertion for it being more like reality was not that all data is denormalized. But if you have things like "color" or "configuration" of item normalized out so that the "shopping cart" is purely product ID listings, than that is just as inaccurate in other ways.
But if you have things like "color" or "configuration" of item normalized out so that the "shopping cart" is purely product ID listings, than that is just as inaccurate in other ways.
Here I would disagree. Normalization can not become inaccurate, you just keep getting rid of more and more redundancies the further you push your normalization efforts. If you have a product with different color options, then you probably need a two level model, one for the product and one for its variants. If you are selling computers and allow configuring them, it might become quite complex with different options for different aspects, there might be constraints like incompatible options or options that require others, the price might depend non-trivially on the different choices and so on. So in practice the complexity of really normalizing everything and then piecing everything back together in code is probably what will eventually stop you from going after the very last redundancies.
I don't want us to just hash over silly examples that are easy to nitpick. My point is that if you have an inventory, you don't have normalized items. You have "realized" items that have, in fact, copied out the colors and other design choices out in a way that cannot cascade on edit.
Stated differently, if the data can always cascade out with editing, than yeah, it should probably be normalized. If it is a design choice that gets fixed on application, it can be argued that it should be copied out when it can't be changed.
A materialized view is absolutely demoralized. The only distinction between something like a materialized view, an index, and just ad hoc representation of the same data in multiple places is that with an index and materialized view, it's explicitly considered to be a read only reflection of some underlying data, whereas some other structure is more likely to have common pitfalls such as not properly having a single source of truth.
But, yeah. My question for the thread is if that is considered in this case or not? I'm assuming the poorly done denormalizations are of the editable kind?
Depends what you mean by editable. Even though a materialized view may not be editable, many of its underlying values may be, and that's the root cause of the problem with denormalization.
Denormalized data effectively involves caching some source value in a durable way. As the saying tells us, cache invalidation is one of the two hard problems in computer science.
Of course in the denormalized DB context, detecting invalidation is typically simple: it's assumed to occur as soon as an update of a denormalized value occurs somewhere. At that point you may need an ACID update of all "cached" values, which is conceptually and computationally expensive. In the materialized view case, you may not do an update, but then you have an out of date cache.
So, we do see the effects of denormalization in a materialized view, but the pros and cons are a little different than in the case where data is stored in a primary table.
I mean editable in the view. That you may have to cascade an edit out, makes sense. But, that fits how things are physically, too. Any store that puts prices on items will have a similar physical constraint that they are under.
> Oddly, I'd argue that normalization is often the "simplification" that is not accurate to how things are.
Normalization is what makes sure that the data you are storing matches exactly the model you have for it. Without it there isn't any meaning to talk about data model.
If the real data actually doesn't match the model, then yes, your model is inaccurate. That means your software is also probably buggy, and that if you didn't normalize your data structure, you wouldn't even know about it.
My point is that a lot of data in the real world is not normalized. Simple stuff like "branding" and "manufacturer." In a normalized world, you update the name of a product, and all units are magically updated. In a materialized world, they are what they were when they shipped. We may pencil over updates, but that has to be done on each "row," as it were.
Unfortunately people see the relational model starting from a "source of truth" database schema and then know about "(de)normalization" in the form of "FORMAL forms".
The relational model is ALL about normalize and denormalize relations (aka tables).
All ways you can combine, recombine, add, remove, change, extend, ... a relation is doing (de)normalization. It means: Add/remove columns, Add/remove rows.
So, `SELECT field FROM table` and `SELECT * FROM table WHERE id=1` are this.
---
`FORMAL` forms are just observations on the way the data is "It has duplicated rows?" and their properties are useful to understand similar to how is good to know what is different between a `set` and a `vector` of things.
And from this, you see how this is useful to use for modeling you RDBMS ("so, I don't need to duplicate all my customer information, I just can use a "pointer"/id for it!").
This article seems to ignore common denormalization patterns that don't cause problems, e.g. denormalizing immutable data such as row ids. These reduce the number of joins needed, and make code simpler and faster, refuting one of the article's claims about code complexity due to denormalization.
The article seems to be saying that the trade-offs of denormalization often aren't being correctly evaluated. Without examples or specific advice, this is not particularly useful information.
I was trying to refer to something more general than a primary key, but primary keys are probably the best example. But the same kind of optimization can be used on any value which is essentially immutable.
The presence of company_id in the transactions table is a denormalization, since the source of truth in that case is actually the `company_users` table. But assuming the ids in the system are immutable, this optimization doesn't suffer from any of the issues raised in the OP article. It makes for simpler code that executes faster. It doesn't "deprive us of the correct answer to the problem."
You know, given the math-y introduction, this rant would read even better with "denormalization" replaced with "numerical methods". The Euler's beautiful identity becomes ugly, not at all the same thing, and certainly not "simpler". They only give you approximate solutions! All in the name of performance! And many number crunchers can offer the "numerical solution" option, but few... nay, none could tell what the precise analytical model would have been. Oh, the humanity!
Well, it sure does suck that instead of perfect analytical solutions/perfectly normalized data models we have to deal with numerical approximations/denormalized schemas that may or may not have dubious properties that are usually tend to ignore in hope that they are actually fine but... that's just how it is. Deal with it, I guess.
Every problem that denormalization purports to solve has had a (better) solution for decades now. If the data isn't loading fast enough, it can be cached (either in an edge cache or a materialized view). Those solutions have costs - but so does denormalization!
>in the name of “simplification” and/or “performance,” it deprives us of the correct answer to the problem,
Is not a universally useful way to think. For many, many problems computers can not give you the right solution and thus the most relevant question is what is a good (tradeoff between fast, accurate, small, implementable, etc.) approximation.
In fact Euler's Identity is very hard to get out of a computer. Just telling it what 'pi' is, is not trivial in the slightest. Certainly 'pi' isn't a floating point number and if you import the mathematics library of your favorite language the thing called 'pi' is just a slightly more sophisticated version of 22/7.
I may be "denormalizing" the question itself a bit, but I think of it thusly:
- The 3rd normal form is when I'm thinking about the entities in a top-down way.
- In a key-value store, I'm thinking of a GUID with a related document, typically JSON, with some arbitrary structure. It's a bottom-up approach. The individual records matter more than the entity.
Argue the theoretical, top-down all day long, and watch me agree. But, like as not, your MongoDb / ElasticSearch / DynomoDb tables are doing a lot in production in the Real World(TM). They get the job done.
Probably driving the point too far here, but it all boils down to Plato vs. Aristotle.
I remember when Mongo was first coming out: People would install it, then shove a third normal form schema in it and then complain loudly that it didn't support transactions, foreign constraints or other RDBMS-specific stuff.
Like cold-welding 2' wings on a '55 Thunderbird and then complaining loudly that it makes a lousy airplane.
KNOW YOUR TOOLS. Just like programming languages, each one is likely good a some things and lousy at others, it is up to YOU to know which is the right one for a given need or scenario.
> The tricky part of this proposal is that I’ve encountered many data modellers in my career who can offer the “denormalized” option, but few … nay, none who could tell me what the properly normalized data model was.
We have a data model like User has many Xs and Ys, both Xs and Ys have a foreign key to the User via user_id. The normalized version would be User has many Ys through Xs. We denormalize and store user_id on both for performance reasons. This seems like a) so common, b) so straightforward a refutation of this claim that I must be misunderstanding the author.
Could you explain your comment further? I don't understand it. Specifically, I don't understand what "has many Ys through Xs" means. That phrase just doesn't syntactically make sense to me. And then you talk about storing "user_id on both" as a denormalized solution---but what that makes me think of is having two independent one-to-many relations (user-to-X and user-to-Y), with a table for each, each having a user_id column linking back to the user---but that's perfectly well normalized as far as I know, so I assume you probably intend to convey something different.
If I understood it correctly, the GP is talking about those normalized tables:
U | X
X | Y
and claiming that if you need the relations of U | Y often, you are better storing them as:
U | X
U | X | Y
Or just the later table, since the first is now completely redundant.
So yes, that happens, and depending on the size of those tables, it may be true that the denormalized for is faster. But it's just wrong to claim that one or the other is faster without any further information, and IMO, the more common case is that the normalized one is faster.
The author is all about "the correct answer to the problem", but is never explicit about what is meant by "the problem". In the case of software development, the problem includes performance constraints and resource usage management. It is not a a correct answer if it cannot be supplied within the time needed and using no more than the available memory and disk.
If I am permitted to sacrifice correctness for performance and resource usage, I can provide an answer to any problem in fewer than 100 bytes of memory and 100 cycles of CPU time.
The target audience is those who do. It's a complex enough topic that it's reasonable the author chose not to explain it (an early question of "where do I begin?" doesn't have an easy answer).
"Denormalization" is saving data in the "wrong" place in a database so you don't have to do a join to a different table. It's "normalized" if the textual data[0] (e.g. a name) is all in the "correct" place (e.g. the user profile) and not in the "wrong" place (e.g. the user's comment). In a normalized database, the comment displaying the username would have a reference to the user record, which has a reference to the user profile record (assuming those are separate; they're often not).
The reverse of normalization. You normalize by moving redundant or non-related columns into other tables. For instance, instead of having a table with (cart id, product id, price) you'd normalize by creating (cart id, product id) and (product id, price) tables. The (product id, price) table is a single source of truth, and, now, no matter what cart something is in the price is known to be correct (or as correct as the product price table). And updating a price means one edit, not many edits.
Denormalization takes that normalized form and reverses it. Often for performance reasons. If updating prices is a rare occurrence, for this example, then getting the price total for a cart is slowed by forcing an extra join. Denormalizing brings the price back into the (cart id, product id, price) table and now the total price of a cart is a single query, no joins required.
In the best cases, denormalization is caching sub-queries within the database by storing the same (or equivalent after transformation) data in multiple places. However, that requires more complex transactions to keep everything consistent every time you update data. If done carefully and correctly, you can sometimes get better read performance from denormalized data.
As others have pointed out, materialized views give you the same advantages as denormalization, except the database takes care of keeping everything consistent.
In the worst cases, denormalization is an excuse to not think too hard about organizing your data, hand-waving that the database will perform better with redundancies.
I guess we have an entire generation of software engineers who never had to cram the degrees of normal forms for an exam? [0] The term "Boyce-Codd Normal Form" is etched in my neurons.
For anyone who seriously studied relational databases the normal forms are very clearly defined. This is not some "unknown" used to justify arbitrary design.
Likewise anyone who has seriously worked at delivering data at scale in an efficient way has consciously considered the trade-offs between normal and de-normalized data.
Furthermore the reason for Normalization isn't because it's pretty, but because it can drastically reduce the complexity of updates when you have duplicated, but mutable, data as well as reduce overall storage space back in an era when that was a major issue.
0. https://en.wikipedia.org/wiki/Database_normalization#Normal_...