Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
From relational DB to a single DynamoDB table (trek10.com)
195 points by forrestbrazeal on Jan 4, 2019 | hide | past | favorite | 66 comments


This is a nice talk to explain what a document database is and how it is supposed to be used. Many think it is about not having a schema, but you could have the database enforce a schema in a document Database.

Document storage freezes access patterns in a way that creates data locality for preferred data access patterns. The major downside is that you might not know all the access patterns and even if you do new access patterns can arise. Either way, it can be extremely difficult to accommodate unexpected access patterns.

I have seen document storage work well for aggregating data from external APIs. The whole point is that you are making preferred access patterns for your data rather than use an external API. And you are probably ready to rebuild your database if your access pattern changes.

I would also expect document storage to work well in an architecture of microservices under the conditions that:

* The schema is small so there aren't many possible access patterns

* Data is shipped to a SQL (warehouse) database for more complex processing anyways.


I tend to believe that one of the most powerful things about Dynamo is the streaming capabilities. It means that you can cleanly use Dynamo as the source of truth for your data, reap all of the performance benefits it has, and automatically ship that data to another database for more advanced BI querying. Moreover, you can do this offline; queue up every mutation that hits Dynamo, then process them at a rate that your data warehouse can ingest without worrying about overprovisioning for spikes.

I've worked on a team of 8 engineers that spent 6 months building a streaming replication system from MySQL to Redshift. Its crazy how simple this would have been had the originating database been Dynamo; its very possible one engineer could have done it in a fraction of the time. It also seems likely to me that an open source solution might already exist.


> I would also expect document storage to work well in an architecture of microservices under the conditions that:

> * The schema is small so there aren't many possible access patterns

> * Data is shipped to a SQL (warehouse) database for more complex processing anyways.

That's pretty much exactly how I've seen it used.

edit: formatting


Especially with the new(ish) DynamoDB features mentioned in this article (ex: adaptive scaling), the backup and restore capabilities including Point in Time Recovery, and on-demand pricing model, makes it a really effective database when going with the single table design.

Wrapping your head around this takes some pretty serious time along with trial and error. Work and rework your data model multiple times. Turning to spreadsheets actually tremendously helps with being able to define and move data around to play that game of Tetris mentioned in the article.

In addition, you'll probably want to avoid any of the common ORM type clients that exist for DynamoDB, they tend to have made decisions that make using a single table design an uphill battle. Write your own little wrapper for the native AWS DynamoDB client in your language of choice and you'll be much happier long term.


" Write your own little wrapper for the native AWS DynamoDB client in your language of choice and you'll be much happier long term."

That seems to be the secret with a lot 3rd party components. Write your own wrapper that encapsulates the features you use but don't let that component dominate your codebase.


And don’t forget transactions...


When breaking up your records into multiple rows by attribute - you then require multiple read units due to how DynamoDB charges you for each row read which costs a minimum of a read unit. So each complete record would have a cost for each attribute as if they were each sized at 4kb (minimum size of a read unit) on read and write.

You also very likely need to use DynamoDB transactions (not really mentioned in the article) or else you're taking big integrity risks with partially incoherent records and so you're paying more for their transaction bookkeeping as well.

Doesn't this seem to map badly to DynamoDB's cost model?


Not exactly. If you find yourself needing to do multiple reads for the same record then whichever row you loaded first, put the other data into that same row. It's all about planning your access patterns ahead of time.


In case you missed it, the code needed to load the Northwind database into DDB with appropriate indexes is here [0].

One fun thing: this was my first time playing with the new DynamoDB on-demand capacity features. If you do run the code, check out the CloudWatch metrics on the table. It's pretty cool to see the WCUs shoot up to 150 and back down to 0 with no throttling whatsoever.

[0] https://github.com/trek10inc/ddb-single-table-example


My biggest fear with On-Demand is that there's no cap to the cost, and an unruly service or access pattern could blast up the RCUs/WCUs instead of getting a simple throttle and timeout. Is this a valid fear, and is there any way around it if so?


It’s a valid fear. Dynamo can get really expensive really fast with on demand or auto scaling.

I have a dynamo addon for Heroku apps and need to keep cost under control. I use AWS Budgets and put a fixed price monthly budget on every table. If users go over the budgets access keys get disabled.

https://elements.heroku.com/addons/autodyne

https://aws.amazon.com/aws-cost-management/aws-budgets/


Why would there be throttling? A WCU is 1KB per second.


Prior to the release of on-demand scaling, you had to provision your capacity units in DDB, and even once they introduced autoscaling there was always the risk that a burst of activity over your set threshold would lead to some throttled calls. My point isn't about the specific numbers so much as that it's nice to see that scaling finally being managed smoothly - one less thing to worry about.


https://aws.amazon.com/blogs/database/how-amazon-dynamodb-ad...

They initially implemented capacity units to be spread evenly at the partition level and now they have finally implemented it to work at the table level (which is how they sell it and how they talk about it).

Big whoop, you still have to ask daddy if you want to write at more than 80MB/s per account.


I was first playing with dynamo when the autoscaling feature was released... it wasn't great at the time (neither was manually scaling). It's nice to see the feature is working better now.


To be clear, the auto-scaling with provisioned capacity still lags behind demand quite a bit. The new on-demand pricing makes this a non-issue but you pay for it at much higher cost at any sort of sustained load. https://serverless.com/blog/dynamodb-on-demand-serverless/ has a good breakdown of when to use on-demand vs provisioned capacity.


Paying in terms of write units adds so much complexity to what would otherwise be dead simple code.

Normally, you just write to a data store until it's done. The faster the disks the less time it takes.

Now, writes can "fail" so you need to wait/retry or put it back on a queue to reprocess when you have write units available (both of those and the associated logging will cost you $ on AWS).


I would offer, that depending on your querying needs with a little added complexity, you could replicate the data to an elasticsearch cluster for more complicated queries and use dynamo as the source of record/control.

IIRC, you can setup events based on dynamo records that can trigger a lambda function to update the search db.


This what I do: all writes go to Dynamo and it replicates to ES for querying/filtering. It's the best of both worlds for operational stuff, IMHO, although still not perfect for ad-hoc querying like SQL.


No matter what, all data has to land in a data lake for BI and general reporting.

And always define reporting requirements up front. The last thing you want to find out is that you need a real-time dashboard and you don’t have the services or storage to support it.


This kind of a design seems to make certain rows be accessed very often, no matter what data ends up actually being read. Doesn't this imply a hard bottleneck as, whilst DynamoDB IO operations can scale __infinitely__, they can only do so if the operations are spread over a random distribution of keys over which the backend can be partitioned, thus scaled horizontally, right?


That limit is going to be pretty high. Adaptive capacity (or on-demand) gives you up to 3000 read capacity units / second and 1000 write capacity units / second per partition.

The general solution if you think you are going to get to volume where that is not sufficient is to do something like add an additional integer (or other known symbol) to the end of the key. Ex: `key.1`, `key.2`, `key.3`. When doing operations you would have to run N operations across the key + symbol space. Not as elegant and you'll pay the additional computation for those queries, but since they can all be run in parallel it's not too bad in terms of latency.

Now, if all you are doing is hammering away at a few rows, you are stuck for sure. DAX [1] or some other caching layer may be your only way out of that problem.

[1] https://aws.amazon.com/dynamodb/dax/


Yes, that was my reaction on reading this, that it will only scale as long as queries are well distributed across the customers, products, etc.


I can't wait for the day when we figure out how to "scale to zero" with database engines and also still provide quick starts. The serverless model is fantastic for smaller projects that can't justify the monthly cost of RDS. I was super excited about Aurora Serverless until I found out cold starts take about 30 seconds, and even longer if you use a VPC to connect to it from a Lambda function. The single table approach with DynamoDB would work, but the amount of cognitive effort necessary compared to just using a relational database isn't worth it for small toy projects.


FaunaDB (my employer) scales from zero to worldwide, with no pre-provisioning or capacity planning required. In addition to serverless database pricing and no cold start, you get full ACID transactions as well as relational features. None of the overloading of shard keys etc as featured in the DynamoDB video.



(I work for Google Cloud)

Many various GCP databases (most notably Firestore[0], and Datastore[1] before it) do this today. No provisioning, pre-warming, or capacity planning needed.

[0] https://firebase.google.com/docs/firestore/ [1] https://cloud.google.com/datastore/


Thanks for sharing! I really should have been more specific and said "SQL database engines". There's definitely a lot of interesting work going on in the document store space, but as someone who's more familiar with SQL relational databases it's always a challenge to fit my data model into the constraints of document-based datastores. For example I'm used to being able to efficiently query on almost anything given that it's in an index, and services like DynamoDB and AppEngine Datastore (not sure if this is what Firebase Datastore is now) always seemed to want me to build my data around their unique constraints, making it difficult to move data between platforms. That's the real benefit of SQL databases: they're (largely) compatible with one another (as long as you're not using really niche features that is).


The easiest solution, which is quite easy, at least for Aurora Serverless/Lambda: Cron a warming function that calls something in the cold service every 5 minutes.

This should add literally pennies, if that, to the total cost and is relatively effective. Of course, it isn't "scale to zero", its just never letting it go to zero. But the practical and financial difference is almost nothing.


If I remember correctly, the pricing model for Aurora Serverless is different from Lambda in that you pay as long as the engine is running, whereas with Lambda you only pay when the function is executing. Keeping an Aurora Serverless instance warm would then cost $43/month.


I smell a tool in the making! One where you enter your access patterns as SQL-ish format up-front, and you get your document DB design as output.

Then it should also wrap your new table up in an ORM to protect you from yourself.

Could also support generating migrations to support adding new access patterns, at the cost of modifying the entire database.



It’s called coder + coffee. :)


My only NoSQL experience was with document databases like Mongo and I’ve always thought that DynamoDB was too limiting. This talk single handedly changed my opinion of the usefulness of DynamoDB and how to architect a system around it properly when I first watched it.


This seems like they just want to remove joins with string concats (composite keys) and that doesn't sound like a good idea to me. Though all of this looks pretty slick it seems like it's all based on a shaky foundation.


I was also very surprised by this. I'm having a hard time accepting that I'd have to manually concat my columns to create sort keys and keep track of the corresponding queries for them, e.g. "BEGINS WITH..."? Granted, I only watched the video so maybe I'm missing something and that was just to simplify the presentation.


Yeah I'm with you, I've been reading up today about the GSIs in Dynamo but they really seem like a hack to get you a usable database, I need to come up with a good reason to use it at work, probably migrate some older databases so that we can multi-DC them.


Anyone else have trouble scrolling that page? It's scrolling very slowly, and page-up/down don't work. Firefox 64/Linux


Yes, FF/Linux. They must be doing some magic with scrolling.


Seemed to work fine for me. FF64/MacOS.


I just finished a big, critical project at work that uses this exact approach with an adjacency list in DynamoDB with 5 GSI's for different access patterns. Not only is it super fast, it's also very extensible. As this article states, however, the most important and hardest part is ensuring your design is correct because you can't really change it once you implement it. You can add features but it's basically impossible to modify an existing implementation. As long as you do the right planning, DynamoDB works great!


I love dynamodb in theory. It's a great C* replacement for simple users who don't need some of C*'s features.

What I don't like about the auto scaling is it seems to be limited to either scaling reads or writes automatically. That seems short sighted. I don't understand the technical reasoning.


There are two sets of settings for auto-scaling reads and writes, but you can turn them both on.

Or you can turn on the very recent “on-demand pricing” and not worry about capacity settings.


Isn’t there a partitioning issue on the GSI here?

I mean, all customers have something like pk=customer1 sk=customer so all customers are in the same partition of the GSI. I know it’s a big deal To distribute the table partition key but doesn’t that also apply to the GSI?


It does apply to the GSI. This point is briefly addressed in the sidebar about sharding father down in the post. Not sure the "customer" records would have a problem, but the static GSI partition keys for orders, products, and suppliers could potentially run into issues.


Dynamo as a replacement for RDS is great until you come across these simple things that are easy to implement in a RDS (like a paging model for row results) and you need to jump to a page in the middle. it’s the biggest PITA.


>Step 1: Define the access patterns you think you’ll need

What happens when they change?


It depends. You might get away with just adding a Global Secondary Index or you might be totally screwed. DynamoDB sacrifices flexibility in favor of scalability.


You won't be totally screwed since you can always export/import your data into a new table with new indexes, but it could certainly take some work. That said, re-indexing large amounts of production data will take some planning with any database.


You can add GSI’s to live tables, but not LSI’s.


I wonder how all those column-based database deal with salability. I've heard some miracle stories about their performance and they seem to be used for really, really large data sets. I've never heard anyone complain about lack of flexibility, but of course there are many different flavors of column-oriented storage. Some are relational and support SQL.


Yea... they can't really change. You can add but you can't really change. Not easily anyway.


I love the idea of this, but it is so hard for me to wrap my head around. Can never get my mind out of a relational model. Of course for the personal project I am working on I am not sure NoSQL makes sense, but that could just be me not knowing anything other than a relational model. It seems a lot like some functional languages. Just so foreign compared to what I have always known.


And what's wrong with relational database for this use case? I don't see the point of conversion? NoSQL has its place, but its not for replacing the Northwind database.


The way I usually think about it is I start with designing the queries. SQL has you go backwards, you design the data per how it lays out denormalized, then write complexity to answer the questions you want. NOSQL you think about answering your queries and model to that. Think about basically most apis just hitting one table and a back index max. IF you need to do ad hoc querying do what others have said and stream to a different data store. Eg design fore scale first based on the most common user actions.


That does make sense. I am just trying to think of it in terms of my current personal project. How I could lay it out, but the problem is it is mostly ad hoc queries since it is historical data. Like storing game results and statistics. I think relational for my current situation works best. However, this is pretty cool in the future for projects I would have just used relational before. Watching the videos about it have been a great help. Helps me figure out the use cases.


What's the TPS on any of your ad hoc queries? If it's low and very specialized than relational or a lucene or redshift type system likely is what you want, eg more like ERP.

If say you're doing dotabuff type stuff then just chunking the records you want out by player or even just an S3 Blob that is updated might be better. Then the rest of your tables are literally just indexes to the large amounts of data.

It totes depends on what you're doing.

The big issue with relational really is that you hit a point where transactions just can't keep up with locking across the cluster. If your system doesn't do this, then your next problem is just sharding on tons of data. If you don't have enough data then relational is fine. If you do have that much data then you're going to have to figure out a sane sharding and joining strategy across a cluster, which is more work. At some point in size of data you either have each user on their own shards of boxes, or you are keeping only indexes in the databases to retrieve other data. Both take a bunch of work.

Much of the motivation behind that talk and amazon's drive away from SQL is really just the business risk of hitting that scaling limit while you have a business that is doubling in revenue. Your options when you hit a limit on sql (esp with transactions) is to heavily re-architect it. And that takes a while, so you have to pause growth during that time. At which point a second mover can come roaring past you.

If I was building somthing like dotabuff I'd go with relational database sharded by users or some type of columnar database for the ad-hoc querying. Tho really for dotabuff you could build almost the entire site to be statically served off of cloudfront and just re-crunch data when a new match comes in.

However I'd have dynamo as the front end data ingest record keeper. Things like "did I see this game data already" to avoid using things like FIFO queues.

Happy to talk about it more if you provide more examples of what you're trying to map.

And again, if you don't have these problems, relational is just totally fine. You can scale quite high on relational, it's just that you run the risk of hitting that asymptotal tipping point where the system just goes down and you can't get bigger hardware. If you're not going to hit it, the tool you know will serve you well.

Another random thought, I find that people who thing in SQL forget how damn awkward it is... For instance, ask someone who uses Relational regularly how to create tables for users's shipping address. They'll spout a denormalized form that works in many many places. Ask a college hire the same question, you might get more than one table. The thing that is obvious to the regular user will be non-obvious and confusing to the college hire. And I would argue that the table layout is somthing that evolved over 30+ years of the industry doing it over and over.


I just got enlightened. I’ve used both SQL and NoSQL for years now. In my simplistic view of the world I was convinced relational data goes in SQL and denormalized KV single access pattern go to NoSQL. Tools I was using don’t offer the secondary index option. This blew my mind that I could suppport a full relational model in DynamoDB. Thanks for sharing this!


This "index overloading" is an interesting term that I haven't heard before (though it seems very similar to an entity-attribute-value model).

The only document store I've used "in anger" has been DynamoDB, does this technique apply well to other document stores?


Curious: when would you choose DynamoDB over RDS / Aurora?


Rick Houlihan has another talk from re:Invent 2018 where he goes into great detail about this -- see the following Twitter thread for a jumping off point.

https://twitter.com/alexbdebrie/status/1081237474930769920


I just finished watching that video 10 minutes ago and my mind is blown. I've been using DynamoDB wrong for most of 7 years.


That's an awesome diagram, and I look forward to the video.

I think this nicely illustrates why most projects shouldn't use a high-scalability NoSQL database: they benefit much more from flexible data models and efficient query.

All that being said, there are other factors when choosing a database aside from those three. For some of our microservices, my team doesn't need scalability or flexible queries, just efficiency. However, we've found that it is cheaper/easier for use to use DynamoDB than RDS.


If you write a lot you should probably use Aurora because less $$$.


The video is very good, everyone should watch it.


omg ! why put yourself through this nosql silver bullet ! NoSql has a very nich application (Exteme high load and low dimensional data with little relationships) This automatically exclude.most business cases ! RelationDB (plain old sql) is what you looking for 9/10 !!! The amount of stupid workarounds ive seen just to say 'we use dynamoDb' is epic...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: