We did a similar migration (somewhat larger database) with ~20 seconds of downtime and much less work... using the magic of AWS RDS Blue-Green Deployments [1]. Surprised they aren't mentioned in the thread yet.
Basically, you spin up a new Blue Green deployment with any desired changes (in our case, we were upgrading Postgres major from 13 to 15). While your blue configuration continues to serve traffic, AWS uses logical replication to keep the "green" deployment in-sync. You can keep modifying (or testing) the "green" deployment (eg you could load test it if you wanted to), as long as you don't do any writes to it (writes still have to go to your live, blue configuration, and are replicated to green).
When you're ready, you run the "switch" command, and AWS does a few things for you: run checks to ensure blue/green are in sync, stops writes and connections, waits a few seconds to ensure replication is caught up, renames your database, then allows connections/writes again. We had less than 20 seconds of downtime, by our count. And, we had a primary and several read replicas and AWS successfully switched the full configuration over with no hiccups. You don't even need to switch your configuration because AWS swaps the database URLs for you. Green becomes blue, blue becomes old blue, and when you're ready, you delete "old blue".
Highly recommend! They do have some restrictions (for instance, not sure if it would work if you're switching accounts, etc).
+1 for B/G. That said, I imagine they couldn’t use it due to the cross-account shift. I’ve used it for both MySQL (with much higher QPS than TFA, by two orders of magnitude) and Postgres, and they both went flawlessly.
Read the docs, especially the limitations. Re-read them. Do a test run in the dev environment, under load. Do it again in staging.
Or just YOLO into prod ‘cause it’ll probably be fine, I guess.
I also used RDS Blue/Green deployment to apply a MySQL major engine version upgrade from 5.7 to 8.0. With respect to downtime it worked fantastically, I think we measured 13 seconds of observable downtime from the API.
However we did learn the hard way that RDS Blue/Green cannot be used to apply arbitrary changes. In our case, we discovered RDS Blue/Green can only be used to move up engine versions, not down.
We discovered on MySQL 8.0 one of our stored procedures had very occasional failures, and considered the option of using RDS Blue/Green again to move back down to 5.7. Turns out that's not an option.
Database changes are typically one-way. If your new change includes creating or modifying a table, such that there are new additional columns, and you populate those with data, then downgrading would destroy the changed columns and the data in them. Hence you can't downgrade once you upgrade or you'd potentially be breaking things. To downgrade safely you'd need to backup or snapshot the old database, and then restore your database back to the backup/snapshot, but that's not blue/green.
DB schema migration script frameworks (at least in Python, Ruby & Java lands) do typically support both upgrade and downgrade directions. People skip implementing and testing the downgrade side if the development model doesn't need it but the problem of what happens to the data is controlled by what you put in the "down" migration script.
I'd guess if you can't throw the data away, you won't do a down migration, you'll do an up migration that changes the db to save that data in your preferred way before undoing or reworking the previous schema change.
This is solved more cleanly in declarative schema management systems, where you have a schema repo of CREATE statements, and the tool can auto-generate the correct DDL. You never need to write any migrations at all, up or down. If you need to roll back, you use `git revert` and then auto-generate from there. The history is in Git, and you can fully leverage Git like a proper codebase.
A key component is that the schema management tool must be able to detect and warn/error on destructive changes -- regardless of whether it's a conceptual revert or just a bad change (i.e. altering a column's data type in a lossy way). My declarative tool Skeema [1] has handled this since the first release, among many other safety features.
That all said, schema changes are mostly orthogonal to database version upgrades, so this whole subthread is a bit different than the issue discussed several levels above :) The root of the blue/green no-rollback-after-upgrade issue discussed above is that MySQL logical replication officially supports older-version-primary -> newer-version-replica, but not vice versa. Across different release series, the replication format can change in ways that the older version replicas do not understand or support.
I have two theories where people end up wanting them:
(1) Circumstances that for some reason enforce the requirement that sysadmin type ppl always have to be able to downgrade / roll back deployments without "the developers" producing new builds or sw artifacts. A separation of ops and dev teams, where you decide you need to survive an inability to make or procure new software builds on demand, and just dig up old deployment artifacts to use after down migration. There are a lot of wrong reasons to do this in inhouse sw settings, but also I guess the classic "we bought a 3rd party server app and plugged it into our onprem database", like Jira or something.
(2) Systems that are technically unable to recover from errors happening in db migrations (missing transactional schema change feature in db and/or application doing db related stuff that can't be rolled back at deployment time). So the down migration is more like a hand coded rollback for the migration that will be automatically run in the failure case of a deployment.
In both cases I can see how the "what happens to data in new columns" situation might still work out. In the (2) case it's sort of obvious, there's no new data yet. In the (1) case you live with it or choose the backup restore path - I can see scenarios where you decide it'd be much worse to restore from backup and lose people's entered data for couple of days, or however it took to find the showstopper for the upgrade, vs run the down migration and just lose new feature related data. (Which you could also rehearse and test beforehand with backups)
Our in-house schema migration tool supports downgrading, but it won't remove non-empty tables or columns etc.
For us this isn't a big deal though because we're writing our software so it should be able to function as expected on a DB with a newer schema. This makes upgrades much easier to handle has users can run new and old software side-by-side.
I wonder if that could be because MySQL 8's replication is backwards compatible but MySQL 5.7's isn't forwards compatible. If so, it makes sense that you're only able to move forward.
We did the exact thing not too long ago, but we couldn't do it using blue/green. We were able to minimize downtime quite a bit, but it was on the order of minutes, not seconds. I wrote a little bit about the process here. I spent a lot of time prepping and running the migration, so happy to share any details if it's helpful.
We recently did this on my team over Christmas this year. We opted not to use Blue/Green for this but instead spun up an encrypted snapshot and began replication from the old database to the new database using MySQL’s tools. Once traffic on our platform was sufficiently low, we cut connections to the database, waited for for replica lag to reach 0 and relaunch the servers with the new databases host info. Our downtime was around a minute.
This is how we're planning to encrypt our RDS Postgres DBs, but with PG's built-in logical replication.
I tried it with DMS and it was so cumbersome. We would've needed to make a lot of schema changes to get DMS to function. Specifically DMS can't replicate a number of character column-types because they were considered CLOB.
Good question. This was a pain point for my small team (me, myself, and I) a little while back. We had an unencrypted db deployed with CDK, and then tried to get it encrypted without losing data.
How did you go about stopping and restarting applications which reach out to the database? We have a number of tasks running in ECS which can take a minute to spin down and a few minutes to spin back up.
For our web service, we didn't stop anything. They had a few seconds of errors though it seems like some sessions were just buffered or paused and experienced high latency.
We also had background worker services. For the very high throughput ones, we spun down the # of tasks to a bare minimum for <5 minutes and let the queue build up, rather than have a massive amount of errors and retries. For the other ones where throughput wasn't high, we just let them be, and during the downtime they errored and retried and the retries mostly succeeded.
+1 for Route53 Groups and B/G setups. We did something similar with PG upgrades, no downtime with AWS R53 groups & retry inflight queries with a custom Rails ActiveRecord transaction patch.
Trade off: For a few seconds some requests were slower.
DNS Groups w/ retries is a nifty mechanism for these things.
I'm planning to do this soon. Is there any problem skipping versions when upgrading? I would like to go from Postgres 12 to 16. Does Postgres handle that without worries?
There are various ways to 'pause' incoming postgres queries, for example using pgbouncer, - ie. don't fail them, simply delay them until the replication has caught up and then let them continue on the new database.
If anything goes wrong and replication doesn't catch up, you can unpause and let those queries happen on the old database.
Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of added page load time. But more importantly, of the thousands of users of the database who have never seen a query fail before and might have buggy error handling codepaths or have a single failed query ruin a whole batch job, this approach leads to a lot less collateral damage.
A lot of the discussion boiled down to 'this is a lot of complexity to avoid a few minutes of downtime'. I guess this is the proof, just use AWS Data Migration Service, swap the DNS entries to go live and live with 11 seconds of downtime.
There is no "just" about it. The absolute key takeaway is in "what we learned":
> We chose to use DMS because it was well supported by the GOV.UK PaaS and we could also get support from AWS. If we were doing a PostgreSQL to PostgreSQL database migration in the future, we would invest more time in trying alternative tools such as pglogical. DMS potentially added more complexity, and an unfamiliar replication process than what we may have found with other tools. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.
Even AWS in their own docs says to use the native tools when migrating from postgres to postgres[1]. They don't go into the details to much and points to pg_dump rather than pg_logical, but interesting to see that they don't recommend using DMS for it
They do, but those recommendations are buried quite deep in the documentation, well behind all the marketing guff that suggests that DMS is all things to all people, and wonderful magic that is ideal for all situations.
There are a lot of gotchas with using DMS (which seems to use pglogical under the hood). Since it’s not hardware-level replication, you can run into issues with large rows/columns/tables and it doesn’t really handle foreign keys. It may not handle some special data types at all. You also need to update the sequences after the migration or you’ll get errors about duplicate primary keys. You can also have issues if you don’t have proper primary keys, because it doesn’t always copy the entire row at once.
If the databases are within the same AWS account, it’s likely easier to use hardware-level replication with global database or snapshots to do migrations if you’re ok with 4-5 mins of downtime.
There are many options available with PostgreSQL you could also do a physical full backup + WAL level replication to keep key AND a get low downtime.
What might have oriented theirs choice is that they wanted to upgrade from major version 11 to 15 during the migration process. This is only available using logical replication. Otherwise you'd have to chain upgrade process of each major version (and possibly OS because 11 is EOL on some arch) and this is nor trivial nor quick.
We recently completed the migration of a self-hosted 3 TB PostgreSQL database from version 12 to 16, transitioning from Ubuntu 18 to Ubuntu 22. Concurrently, we had to upgrade various extensions, most notably Timescale, for which a compatible version did not exist across all scenarios. We performed the upgrade by updating a replica in the following sequence:
- Start: PG12, Ubuntu 18, TS2.9
- Step 1: Set up a read-only replica with PG12 on Ubuntu 22, maintaining TS2.9.
- Step 1.5: Enter maintenance mode and halt all services.
- Step 2: Detach the the read-only replica, upgrading from PG12 to PG15 on Ubuntu 22 with TS2.9.
- Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.
- Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.
- Step 4.5 : Reconnect services to the new database server, resume all services, and exit maintenance mode.
All the database upgrade steps were well-tested and automated using Ansible. Nonetheless, we did encounter an issue that had not arisen during testing. This extended our downtime to approximately half an hour, which, for our use case, was perfectly acceptable.
Employing logical replication could have mitigated the last-minute surprise. So we will consider this approach for our next upgrade cycle.
We recently followed an almost identical upgrade path (however at the time PG16 wasn't yet supported by Timescale, so stopped at PG15 + TS 2.12).
We did look into using logical replication to reduce the downtime of the upgrade, but because database schema and DDL commands aren't replicated, it seems that it isn't recommended with Timescale in the loop.. (I suppose the underlying schema changes that Timescale needs to make under the hood are mostly a function of your hypertable chunk sizing and what your incoming writes look like, so this could be planned around / timed well, but we felt it added too much potential complexity & risk compared to simply opting for a small maintenance window while pg_upgrade completed).
For a software engineering project, you probably want to limit your transactions to much less than that (set statement_timeout is your friend). If you've got extremely long transactions, you can probably avoid doing the switch-over when they run (hopefully they are not a random occurrence but a result of a scheduled job or similar).
In combination with transaction time limit and fail-over configuration (where you fail the old primary), you can control the slowdown (instead of downtime, eg. with pgbouncer) very precisely.
I would be more concerned with the DNS TTL being respected in the entire stack (and external caching DNS servers you rely on), tbh.
But it is usually not critical to avoid a dozen seconds of downtime for an app, so whatever is simpler for you should be your go to solution.
No, but all the code generation was happening in the server itself so I dont know how hard and fast it applies, iirc SQL Server's limits are like 64k x packet size.
Well you’ve certainly introduced a teaching moment to me! What are the nature of writes you deal with that last 30+ minutes? What kind of data/what kind of people are involved with such DB writes where you need to rely on the DB engine to work so hard instead of something more split up by queues at a higher layer?
Postgres queries are streaming. Ie. If the query result has 1 billion rows, but the client only has enough ram for 1 million rows, then the query will be slowed down until the client is ready to accept more rows.
If the client is slowly processing through the result set, then the query can take many minutes/hours.
It's usually badly engineered things. However, just because it's badly engineered doesn't mean it's fine for it to break :-P
Things like a super complex n^3 complexity query to find all users who were 3 hops away from a known fraudster, where a 'hop' means 'shared any piece of account data in common'
> The second change was to create a DNS record in AWS Route53 for `database.notifications.service.gov.uk` with a 1 second TTL (time to live) [..] our migration script just needed to update the DNS weighting in AWS to 100% of results being sent to the target database location and wait 1 second for the TTL to expire. Then, when our apps next try to query our database they will be querying our target database.
Wait. Their (or Python's default?) db orm blocks while it performs a DNS lookup for each and every query!? It doesn't cache resolved addresses for any length of time? No connections are pooled and reused?
It would probably be the OS' `getaddrinfo` or `gethostname` that does this: Python rarely reimplements system level calls, which means it relies on the system's configuration.
If TTL of 1s was respected, they would be cached for 1s, but it's not uncommon for DNS query libraries and especially caching DNS servers to not fully respect TTL anyway: tbh, that might explain some of the downtime they've seen.
I didn't mean it was directly implementing the networking call to the dns server -- just that it wasn't directly caching the result.
getaddrinfo(3) and getnameinfo(3) (guessing that's what you meant) don't implement caching, at least not explicitly in the spec and not normally in practice. On Windows, DNS results are cached by the OS but on Linux that would be distro-dependent behavior and usually requires setting up a local caching dns server (Ubuntu uses unbound out-of-the-box, iirc. Other choices include ncsd and dnsmasq).
Even if they implemented caching at the syscall level, this still assumes no connection stays open for more than 1s or is reused except per query. It seems like a big assumption (at least I hope it is, because I certainly wouldn't want my app to initialize a new db connection, let alone perform a DNS lookup, for every query).
Lovely! We just migrated from postgres 14 to 16 for 3 postgres clusters (servers) on RDS containing about 2TB of data across 8 databases. We were down from 00:00 to 04:00. Steps we took:
* enabled our fallback "maintenance mode" site. It's a super lightweight version of our site running on CF workers.
* scaled down all apps using the db to 0 in terraform
* hit the upgrade button in the aws web ui, which runs pg_upgrade. 14->15
* waited for it to finish
* hit the upgrade button again. 15->16
* waited for the dbs to start accepting connections (they do before they're marked as ready, I think aws does more stuff than pg_upgrade)
* Started `VACUUM ANALYZE; REINDEX DATABASE CONCURRENTLY`. The idea is to avoid performance issues between versions and make use of performance improvements from new versions.
* Started spinning up the apps again
* Waited until all apps had a handful of containers running
* Started accepting traffic (disabled maintenance site)
* Went to bed
The REINDEX CONCURRENTLY happily chugged along for the biggest db for another 18 hours without blocking anything. Next time we're doing aws blue/green deploys to avoid downtime. We didn't this time since we weren't on 14.9 yet (the minimum minor version of 14 supported by blue green).
If I was doing this myself I wouldn't pay the AWS tax, instead do blue/green myself with logical replication and a load balancer.
FWIW GOV.UK Notify is part of a suite of services offered by GDS to UK public sector bodies (along with GOV.UK Pay and GOV.UK PaaS) that was originally known as "Government As A Platform".
Interesting, though I have no idea why the government is using AWS in the first place. This isn't a startup hacking away trying to find PMF, or dealing with unpredictable marketing-driven traffic spikes. We know we need these services running long term, and can make solid predictions about usage patterns.
We could build a public sector cloud and/or adopt a sensible on-prem approach. This requires funding, coordination and technical leadership, but would save the taxpayer an enormous amount over the long term.
Public sector IT is a disaster in general ofc, but I know there are good engineers working there.
As a startup, my reason for using "cloud" (PaaS) is not to catch spikes, but because of focus. Every hour that I spend running around with HDDs, screwdrivers (or the cloud version thereof - storage, ansible, etc) is an hour that I'm not spending on stuff my customers need.
Why would this be any different for a government?
We don't expect our government to build their own cars, but to buy them from Volkswagen or Renault. Even when the government has a clear need for transport. Why do we then insist they build their own IT infrastructure?
So should a government (e.g. the police) produce its own cars, rather then buy them? Or set up a factory to make laptops?
I'm not trying to be pedantic. I'm trying to understand where the line lies (and whether that line is universal for all governments on all layers in all countries).
Your examples do not make sense on prem existed long before cloud and some shops were perfectly fine with it. If the police were producing perfectly fine cars and then they went for a car share program then I would argue that yes it may be reasonable for them to consider going back to producing cars.
> This isn't a startup hacking away trying to find PMF, or dealing with unpredictable marketing-driven traffic spikes. We know we need these services running long term, and can make solid predictions about usage patterns.
If you think government needs and demands are predictable, you don't follow politics (particularly uk politics in the last decade).
And then there are these things like pandemics that completely come out of left field. Being able to scale things on demand over the pandemic was one of the key demonstrators for use of the public commercial cloud by the public sector.
Why do we assume scaling can happen only on cloud? Gov sites are usually very low tech. We are not mining crypto. I would guess even if they outright bought servers and just scrap them later on they would come ahead.
One thing about (at least UK government) is that procurement requirements means that they go to market for quotes around usage every few years. If ie Oracle Cloud was 1/10th the price, it would likely mean they'd win the deal, and so would have to migrate to Oracle for the duration of the contract, and then potentially do the same to another cloud if that was cheaper
I'm sure a lot of countries in EU did this. First hand I know, they did this in Croatia as I was one of the developer who had to use it to deploy on it.
The worst thing I have ever seen in my life. And I worked on a lot legacy apps written in VB.NET, Web forms, old Sharepoint, Basic and even when the whole app was one big mess of store procedures.
AWS, Azure, GC are at least written with thought about end users (us, developers) while government cloud was architectured, designed and built by the lowest bidder whose first goal was to cut and cut and cut his costs whenever possible.
I don't know about the UK, but AWS has has GovCloud in the US for a long time, and to be honest compared to a lot of infrastructure I have seen there it's a blessing. On the flipside, I've met some really amazing infrastructure and ops people in a German gov healthcare institution running the in-house DC, where the problem was neither the tech, nor the people, but 100% the management and their processes, and their desire to be the bottleneck for every single interaction between infrastructure and engineering teams.
If you really think the public sector could build anything closely resembling any cloud, you are dreaming. Imagining that one cloud working for the entire public sector, we are entering delusional territory here.
Public sector projects are insanely expensive, take ages to develop and often flat out fail.
Not only that, we are starved for engineers even now. If we ran such a custom cloud, we would shoot ourselves in the leg by not being able to hire from a pool of experienced engineers.
Oh I totally understand that. Public sector IT is littered with failed projects, massive overspends and lasting functional problems. There are numerous examples from the last few years alone - hundreds of millions spent on broken software. I used to work in public affairs and followed a lot of those projects closely.
I don't think this means "government = bad at tech" though. You sometimes see smaller in-house teams do really good work. The biggest issue seems to be with contracting and procurement policy. For example, on the Police Scotland i6 program they hired a bunch of consultancies to write a tender document, and then hired CGI for the final project. That turned out to be a copy of the Spanish system, which turned into a huge and expensive disaster as the requirements differed.
Feels like government has a similar problem to a lot of legacy non-tech companies. They don't have much technical leadership, don't understand the problem, and decide to hand it off to the lowest bidder. Doesn't help that they are often legally obliged to act this way. But the underlying engineering problems aren't unsolvable, and don't need to become a huge mess every time. (Your point about recruitment is fair though)
If you think the public sector could improve its operational efficiency by building, operating and using their own cloud, I've go a bridge to sell you.
Let's ignore the technicalities completely. Here's a government providing a detailed look into their engineering practices, complete with graphs and configuration snippets. How many governments manage to publish such a thing?
Minor detail but did anyone else notice they are using eu-west-1? Kind of weird for the UK to be hosting there sites in another country (Ireland). I'm sure this isn't super sensitive data but still.
They've been on AWS since before eu-west-2 was added, it's still not at the scale of eu-west-1. Capacity issues are still a thing (but getting much better), and only recently have they become pretty much feature parallel.
I think one of the bits of carry-over legislation in the Withdrawal Agreement after Brexit was the UK continuing to comply with GDPR. So using an Irish DC is still ok.
Very clear and concise blog! Loved reading it. I'd be very curious to see how Amazon DMS performs at scale. Scale includes either of these situations:
a) Larger dataset - 2+TB
b) Higher throughputs - WAL Spikes (at least 2-3K TPS)
c) Reliably and efficiently replicating TOAST Columns (ex: large JSONBs).
d) Advance data-types - BLOB, CLOB, HSTORE, ARRAYs etc.
In my previous role with the Azure Postgres team, we observed that Azure DMS worked pretty seamlessly in migrations that did not involve the above situations. Once either of the above came in, the replication experience was hairy. Interestingly, in my current gig at PeerDB, many use-cases we are seeing have either of the above situations. We have been doing quite a surgical optimizations to handle those situations and they keep coming! That is the beauty of Postgres, enabling a myriad of workloads, each of the unique in their own way! :)
I've also used DMS to migrate smaller datasets (~200GB) from AWS RDS MySQL 5.7 to RDS PostgreSQL 11. 10 seconds of downtime here was actually incurred not due to the migration itself, but to enable MySQL binary logging on the RDS instance, which requires a restart, and which AWS DMS uses to replicate changes from the replication source (MySQL) to the target (PostgreSQL).
Traffic was steered to the new PostgreSQL instance not with DNS/Route 53 weighted records, but an application-level percentage enroller (based on user ID). Our particular set-up did in fact have apps talking to both databases simultaneously, each with a percentage share of traffic, and so we did not need to incur additional downtime to cut off traffic to the original source database - though now you do have to worry about eventual consistency.
I wouldn't recommend using their Schema Conversion Tool. We instead migrated the data as 1-to-1 as possible into the PostgreSQL target table, and then used plpgsql triggers on the replication target table to clean/validate the data and write it to another table with our desired (and more strongly typed - no more storing JSON as TEXT) schema.
There were also some issues with the replication task seeming to stall and stop during the change data capture (CDC) phase. As mentioned support is kind of spotty in this area, though we learned it may have to do with sizing the DMS replication instance correctly.
I love that where the rubber meets the road it doesn’t matter if you’re a government organisation or a web agency, it’s still just plain old Postgres, python, sqlalchemy and dns with a low ttl.
Back when we first tried out DMS there was a fun bug where booleans were interpreted as strings, so all false records were interpreted as “false” and translated to true on the target DB. It was fixed shortly after but was a good reminder to validate your data during a migration.
On a similar case AWS DMS did not work for my team and we proceeded with pglogical.
Although it has some limitations it worked pretty well, and personally I like that it is cloud agnostic. Definitively I would recommend to consider pglogical first.
They don't just sent the notifications but store them? Sounds like it might contain PPI as it records passport extension data, etc. Might be minimal, though over 1 billion rows (400GB) sounds massive trove to keep around.
What is this weird fantasy that programming looks like this? It's always the same: a couple of people, always of diverse backgrounds, physically interacting, usually pointing and laughing at something, with perhaps another group, representing different demographics, looking on with a mixture of curiosity and awe.
Do an image search for programming to see more examples of this.
I guarantee this work was the result of guys sitting at desks, possibly in the dark, thinking. Are we collectively embarrassed that this is what programming looks like? Or are these just fantasies of management types who have never written a line in their lives?
The primary goal of most large organisations' dev blog is to attract potential candidates to the top of the recruitment funnel.
This picture appears to be (presumably) the actual team doing a team building activity.
This is both a reasonable choice given the inferred instrumental goals of the blog, and a honest reflection of (one part of) the team life, assuming it's a photo of the real team.
Nah, this is the thing that leads to open offices etc. I want people to understand what programming actually looks like and these silly pictures need to go.
No. But I wish government agencies depend less on other entities. A government agency isn't in a business to make money. They should be concerned less with economic efficacy and more with accomplishing their goals.
Out of all the entities to depend on, AWS is tried and battle tested with US gov.
Additionally, a pandemic has shown that government services actually need to scale to arbitrary amounts, seemingly at random.
Don't you think that in this case using AWS is better to simply guarantee availability? Imagine tax payers money being spent on servers and then a pandemic happens and when the economy is screeching to a halt, more tax payers money is needed to expand the cluster, money that really would be useful to spend elsewhere.
> A government agency isn't in a business to make money.
true, but it's not in the business of wasting money either. if they don't take care about economic efficacy, then someday voters are going to judge them for wasting tax payers money
We tried to work with DMS to migrate mysql to postgres, and it was a nightmare. Support was useless, and would often just not get back to us without prodding. Then them giving us canned responses unrelated to our queries. The whole thing is nigh on un-debuggable.
I worked on migrating our MySQL system to PostgreSQL using pgloader ( https://pgloader.io/ ).
There were some hiccups, things that needed clarification in documentation, and some additional processes that needed to be done outside of the system to get everything we need in place, it was a amazing help. Not sure the project would've been possible without it.
Data mapping from PostgreSQL to PostgreSQL as in the article isn't nearly as bad as going between systems. We took a full extended outage and didn't preload any data. There were many dry runs before hand and validation before hand, but the system wasn't so mission critical that we couldn't afford to shutoff the system for a couple of hours.
We also ended up using pgloader. Its not without its friction either. For example the concurrency/number of rows setting seems broken out the gate, and its like playing darts with a blindfold on to get it to run without running out of memory. But being able to read the source, github actions, and overall at least I could debug my problems, or find others who had the same issue.
We also tried to use DMS for a few things (Rackspace to AWS migration, replication out to data lakes, etc) and it has been consistently undersupported, buggy and ate months of time before we went to other solutions. While a lot of AWS support has been good; not for DMS. It feels entirely half baked .
+1 DMS is very half baked. Silent, unbuggable errors. Tons of unsupported LOB and CLOB data types. Built-in Postgres logical replication is way easier.
This was my experience as well. We thought it was a nice managed way to move pg to pg with minimal setup but we ran into so many issues we just did logical replication with bespoke fixes for things it didn't handle well.
Surprised by that. I've used AWS DMS quite a lot to do both on-prem to AWS and AWS (MySQL) to AWS Postgres migrations and long term ongoing replication. Whilst there is some complexity/gotchas there it's always been more than up to the task. Takes a little bit of validation/testing to understand but it's very well documented too.
What sort of issues did you hit? In all honesty I'm not sure I've been more impressed by another AWS service.
It seemed to try to load data into the wrong table on postgres. That was the one that immediately comes to mind. Honestly poor support is what really killed it for us. But we had other technical problems with it.
We burned 3 weeks just trying to get support to provide a sensible response. I never got the sense anyone replying to us knew any more than the surface level about the infrastructure of how DMS worked.
We kept our table mappings pretty much mirrors of the source tables. Any data transformation we managed on the target cluster, not through AWS DMS.
I've used it fairly frequently over a number of years so maybe the issues we hit on the learning curve have dimmed.
We also deliberately kept things as simple as possible at first and focused on DMS's ability to move the data from source to target and really only tweaked settings that increased security or reliability. We stayed well away from any of the data transformation or more complex functionality.
One issue we hit were any schema changes totally messed it up. I don't have my notes in front of me, but we were constantly hitting data that wouldn't migrate, or that things suddenly broke whenever things changed.
Interesting we managed several schema changes when using AWS DMS for replicating data over a long period between MySQL and Postgres clusters.
We treated these carefully and tested as we made them but never had any real issues with them. From memory DMS could cope with adding columns pretty transparently. One setting we invested in configuring and understanding was to allow DMS to attempt to recover from replication failures. This allowed it to error on DDL changes and attempt to recover. This usually involved restarting the task, but it would do this transparently as part of the recovery.
AWS has a lot of pre-audited compliance built into their services. Being able to inherit their certification for services can save an organization a lot of time and effort.
Its not an organisation, its a blucking government, it handles citizen data, and its sending them to a company of foreign country, because it can’t hire some system administrators? A GOVERNMENT? What are they doing? Still looking for their product market fit and can’t afford the headcount? Is it a joke?
EDIT If they are looking for money id like to participate a bit in the seed round
Sysadmins are cheaper than many people seem to think.
I had a person I trust a lot telling me that "if we go with a bare metal provider like GCore we'd have to hire someone", his reason for bringing that up was that the cost difference would be justified by not having to hire someone,.
However a GCore €400,000k/y bill becomes a €6,000,000~ if you were to use a public cloud, even with the scaling up and down when not in use (we are an extreme case of needing a lot of dumb unreliable compute thats geographically distributed).
I can hire a lot of sysadmins for that money, but I probably don't even need one because public clouds also need devops staff to manage the complexity anyway.
That would indeed be a risk, but the circular logic of this means no new company could ever have any competence outside of its founders. Which feels shortsighted.
Anyway, I am a former sysadmin. I am confident that I can identify competence in the requisite areas.
Governments tend to be far less competent at determining technical competence. Due to a wide variety of factors, governments tend to be completely uncompetitive in salary for technical positions meaning they're already hiring from the lower end of the pool (not including a few altruistic folks willing to forgo their market value).
At a company if a department isn't working out you just restructure and move on, but in the government, that team is going to retire in your org and collect pension from you, and there's very little you can do about that.
yeah, every company I know of that uses cloud has a team responsible for managing it anyway, and they don't seem much smaller than the team needed to manage on-prem. I don't really think this 'savings' exists in most cases.
If you are worried about that, start with your government use of Microsoft Office and Windows who both send MB of data per minute to a US based company.
> (...) its a blucking government, it handles citizen data, and its sending them to a company of foreign country, because it can’t hire some system administrators? A GOVERNMENT? What are they doing?
This is a very good question, and bears repeating.
It's not a massive database as well. 400GB with 1k inserts/second.
> because it can’t hire some system administrators?
Spoken like someone who has never worked in the public sector. Hiring can easily take 6+ months or more due to an ever-increasing list of requirements that government HR is required to fulfill, not least of which is passing a security clearance which takes even more time. The best people on the market rarely have the patience for this. Once your employees do get hired - on-boarding can take another few/several/more months, getting various permissions, technical documentation, etc. Everything is out-of-date because making changes requires committee consensus, in a culture that is risk-averse, because nobody notices when you out-perform (after all, the requirements were also set by a committee that doesn't know who you are) but something going wrong is grounds for termination. Public sector work over-relies on hiring contractors precisely to shift blame for failure to the contractors. Managed database services are excellent tools to shift this kind of catastrophic risk of data loss to a contractor/vendor (who is managing the database).
Governments not owning their data isn't due to technical or budgetary limitations - it's strictly cultural.
Fully agree with this. I'd also add that a lot of IT is buy not build, in general. That includes support. Particularly true for the public sector and has been in place well before AWS existed.
Outsourcing the complexity to run and maintain a secure reliable database cluster really is making good use of the managed service model.
AWS has government specific regions (called GovCloud). Many services or features make it to GovCloud later than other regions because of the certification requirements.
AWS has US based GovCloud regions: AWS GovCloud (US-East) and AWS GovCloud (US-West). It does not have a UK specific GovCloud region that I am aware of.
Why can't the UK government build there own cloud?
It's just completely insane to me that they would make the gov internet infrastructure completely (geopolitically) dependent on another country AND just literally give all their (citizens') data away AND pay for that "privilege"?!
I mean if the government can't host the government's websites using tech from the government's country, maybe it would be better to just forget about the whole cyberweb thing altogether? Just turn it off?
I don't think you have any idea just how much it costs to run infrastructure at the reliability levels provided by AWS, and just how much investment it would require to get the ball rolling on this.
A lot of people have a very unrealistic picture of what government budgets are like.
My point is that NOT hosting it yourself (as a government) costs WAY more in the long run. See my points above.
The same goes for companies in Europe who literally host their trade-secrets (designs, sales, the entire company) on US-servers (OneDrive, Google Drive, etc). The US is the home of their competitors. Who cares about infrastructure costs if you're PAYING to give your trade secrets away to your competitor(s)?!
They'd still be outsourcing to a firm to do this. They wouldn't hire a load of people to do it in-house. See also Fujitsu in the recently-popular Horizon scandal, or the NHS for IT debacle[0].
Why can't the UK government build their own cars? Their own boots? Their own pens, paper? How wasteful and pathetic that they wouldn't make all those things themselves. If it's possible to do it yourself, by golly, you should do it yourself, and there's absolutely no reason in the entire world to purchase those things from someone else instead.
I've always wondered how beholden the world is to Microsoft. I was once surprised to learn the US military (and probably virtually all others) don't have their own OS to avoid being tied to a particular company.
you want every government to build their own cloud? what in the world? the whole world is interlinked, should they also manufacture their own government laptops in the UK?
How would you feel if the US government ran on servers from a European company, which also works very hard to avoid paying taxes in US soil?
All those reasons to go AWS hold for a private company, not for a government service of a first world country and G7 member. AWS has a lot of compliant services, but it's not like they're doing rocket science one of the top 5 richest countries in the world cannot afford to develop or contract within its borders.
The simple reason is that the UK has been on a long trend of selling out to the highest bidder, whether they are US tax avoiding companies, chinese or managed by Russian oligarchs. We have chosen AWS for the same reason Post Office chose Fujitsu.
There's no govcloud in the UK; unless there are specific terms then the terms-of-service state that you are licensing either the irish entity or the american entity to have access and dominion of your data.
I had to spend a lot of time writing my privacy policy (perks of being CTO... yay), and part of that privacy policy was an admission that we transfer ownership of data to a US company (by using public cloud) despite using european datacenters.
This is because our agreement is with a US entity.
eu-west-2 is a bit misleading, most of its nowhere near London, they've got DC's right up into the midlands. One of their newer ones for example is out in Didcot Oxfordshire, they've also got a few up towards Peterborough. All classed as 'London' despite being a fair distance away from it.
> This is an AWS RDS PostgreSQL database and it lives in the PaaS’ AWS account. Our apps that run in the PaaS talk to this database. We are going to call this database our ‘source database’.
I don't know what it's like in UK but it may be the case that government has a hard time a{ttract,fford}ing talent to administer everything in house. Not that AWS is great for cost saving but if its between paying 50k/year for cloud services and not being able to find an engineer who will competently do the job for less than 50k, then the cloud is your only move really.
They require various clearances (digging into your life and past relationships to a miserable degree), don't allow someone to have ever smoked pot and pay half or less of what you can make in the pvt sector here (usa).
Everyone I know working FedRAMP jobs is prior military/g-level.
They wouldn't need that. And having been SC cleared in the UK, and known a few DV-cleared ones, at least in the UK they don't care if you've smoked pot. They just care that if you have, that you don't mind your family knowing one day. They don't want people who can be blackmailed.
Here it's like this: Don't ever lie to them, "no matter what it is they'll find out."
So, some people don't lie, say they smoked pot in high school and none of them make it to the next step.
I had a twitter convo last year or pre-x whenever with the CTO of some org I can't remember (I don't think centcom, something much smaller) and he mentioned that they've lightened up quite a bit, or at least his program which was a softwar engineering group was more lenient. He was looking for engineers on via twitter on his official account.
Once your past the emerging startup status, running on the cloud involve as much engineers and complexity as running on prem if you want to follow best practices.
The "let's be managed and only hire developers" is a huge myth. All large organizations involve tons of "cloud engineers" or "devops" depending on how they want to call them and are just sysadmins with a different name and a bigger paycheck.
Having actual datacenters doesn't add a ton of complexity and datacenters themselves are often managed by people who don't even have an engineer paycheck. The main difference between being on prem vs cloud is you have to plan (how many servers/storage/network equipment you have to buy and replace on the following year) and pay for stuff (like space, racks) more in advance + take into accounts delays in delivery. This is where cloud makes the job much faster for companies but given the slow pace at which gov stuff happen usually I don't think this is a problem for them.
Remember it's not just about being able to find one single engineer - then they become key-person risk. You need multiple engineers to be able to handle the loss of that engineer, either temporarily (vacation) or permanently (suddenly hit by a bus). Then you end up having a team of DBAs. Then you have functional rather than feature teams. Then you need multiple managers to align to get anything done, and have internal politics.
Being able to consume databases as a product has non-trivial value.
As somebody who worked for the European Commission, and a european national government, I agree with your sentiment, but the harsh reality is that government divisions in generally work on a shoe string budget, when it comes to decisions like these. I wouldn’t be surprised if this was a “best effort given the circumstances” move.
I've worked on a number of UK government projects, including some with particularly sensitive security and data requirements.
Having some knowledge of their on-prem data centres and UK Cloud offering they have also used moving to AWS has so many operational, security and resilience benefits that aren't available elsewhere. It's not a free-lunch by any means and needs thought and governance certainly but the procurement simplification benefits alone make going to the public cloud a no brainer for a lot of government services.
It is worth knowing that even the on-prem data centres are usually operated by 3rd parties such as HP, BT and IBM. There was an initiative to have "Crown-managed" data-centers but it's not particularly scalable.
If you saw how non-tech companies run datacenters, well let's just say they're not exactly working with NATO like the big 3 cloud providers do when designing their DCs and backbone.
Honestly you should be frightened when you see someone NOT using a cloud provider, because it is hard work to properly run and secure a datacenter. Even Equinix fucks up HARD regularly and they are considered the gold standard (shout out to those I saw at 350 E Cermak over the weekend).
Yes. RDS is a very reasonable choice if you are a tech company, let alone a govt org. The alternative isn’t “let’s host this ourselves” it is “let’s host this with Oracle at a much higher cost”.
It isn't? AWS is crazy expensive and you don't have as much control over things as you may occasionally need. The best decision we took in the past few years with regards to infrastructure was moving away from AWS and doing everything ourselves.
On RDS we had inexplicable spikes in cost, deteriorating support and no real support for any of our issues. When we tried using DMS, it just didn't work as expected, even after spending two days on the phone with their support.
The alternative - at government scale - is absolutely 'let's host this ourselves' and that's what they should be doing, to ensure that institutional expertise remains. They should also own and operate their own datacentres which should be physically secure, not shared with commercial ventures and guarded by the armed forces, not civilian security.
Why doesn't the government manufacture their own cars? They're going to lose institutional expertise in building cars! They should also own and operate their own manufacturing facilities which should be physically secure, not shared with some 'civilian commercial venture'.
By golly, the government can't do business if it isn't a datacenter operations company, a software vendor, and a car manufacturer.
'moving to AWS' (or any cloud provider) is not 'hiring experts' it's just outsourcing the risk to an entity that you, in the event of a genuine crisis, have no leverage over beyond 'we're going to stop paying you (once we migrate away from you which will take ten years)'
AWS are not experts at providing computing services? Holy cow. This is news to me! I thought they were the most popular and highly regarded computing infrastructure and PaaS company in the world, managing both hardware and software and providing subject matter experts to work with customers on architecture and implementation, along with official partners and a marketplace of turn-key products.
Boy, am I embarrassed! I need to start building my own datacenter right away, all my shit is on AWS!!!
"The PaaS team offered us the ability to migrate databases using AWS Database Migration Service (DMS)."
And I'm not surprised if, they got some kickback, discount etc in some way to promote AWS on their blog. Not claiming its so, but I would not be surprised at all. It reads as one big advertisement.
It incentivizes public-private cooperation: If the government cracks down on Amazon, Amazon turns off the government's AWS accounts and deletes the data. The government finds that subpoenaing a wiped hard drive is utterly nugatory, and thereby learns humility.
There's an absolute ton of stuff on AWS. There used to be gCloud that allowed for smaller clouds to tender for government contracts bit there was a big pull to AWS, at least from my experience with it.
Basically, you spin up a new Blue Green deployment with any desired changes (in our case, we were upgrading Postgres major from 13 to 15). While your blue configuration continues to serve traffic, AWS uses logical replication to keep the "green" deployment in-sync. You can keep modifying (or testing) the "green" deployment (eg you could load test it if you wanted to), as long as you don't do any writes to it (writes still have to go to your live, blue configuration, and are replicated to green).
When you're ready, you run the "switch" command, and AWS does a few things for you: run checks to ensure blue/green are in sync, stops writes and connections, waits a few seconds to ensure replication is caught up, renames your database, then allows connections/writes again. We had less than 20 seconds of downtime, by our count. And, we had a primary and several read replicas and AWS successfully switched the full configuration over with no hiccups. You don't even need to switch your configuration because AWS swaps the database URLs for you. Green becomes blue, blue becomes old blue, and when you're ready, you delete "old blue".
Highly recommend! They do have some restrictions (for instance, not sure if it would work if you're switching accounts, etc).
1. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-...