We migrated our PostgreSQL database with 11 seconds downtime

Ozzie_osman · on Jan 19, 2024

We did a similar migration (somewhat larger database) with ~20 seconds of downtime and much less work... using the magic of AWS RDS Blue-Green Deployments [1]. Surprised they aren't mentioned in the thread yet.

Basically, you spin up a new Blue Green deployment with any desired changes (in our case, we were upgrading Postgres major from 13 to 15). While your blue configuration continues to serve traffic, AWS uses logical replication to keep the "green" deployment in-sync. You can keep modifying (or testing) the "green" deployment (eg you could load test it if you wanted to), as long as you don't do any writes to it (writes still have to go to your live, blue configuration, and are replicated to green).

When you're ready, you run the "switch" command, and AWS does a few things for you: run checks to ensure blue/green are in sync, stops writes and connections, waits a few seconds to ensure replication is caught up, renames your database, then allows connections/writes again. We had less than 20 seconds of downtime, by our count. And, we had a primary and several read replicas and AWS successfully switched the full configuration over with no hiccups. You don't even need to switch your configuration because AWS swaps the database URLs for you. Green becomes blue, blue becomes old blue, and when you're ready, you delete "old blue".

Highly recommend! They do have some restrictions (for instance, not sure if it would work if you're switching accounts, etc).

1. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/blue-...

sgarland · on Jan 19, 2024

+1 for B/G. That said, I imagine they couldn’t use it due to the cross-account shift. I’ve used it for both MySQL (with much higher QPS than TFA, by two orders of magnitude) and Postgres, and they both went flawlessly.

Read the docs, especially the limitations. Re-read them. Do a test run in the dev environment, under load. Do it again in staging.

Or just YOLO into prod ‘cause it’ll probably be fine, I guess.

Ozzie_osman · on Jan 19, 2024

We definitely did a dry run with a parity configuration a couple nights before. It gave us a lot more confidence.

cljacoby · on Jan 19, 2024

I also used RDS Blue/Green deployment to apply a MySQL major engine version upgrade from 5.7 to 8.0. With respect to downtime it worked fantastically, I think we measured 13 seconds of observable downtime from the API.

However we did learn the hard way that RDS Blue/Green cannot be used to apply arbitrary changes. In our case, we discovered RDS Blue/Green can only be used to move up engine versions, not down.

We discovered on MySQL 8.0 one of our stored procedures had very occasional failures, and considered the option of using RDS Blue/Green again to move back down to 5.7. Turns out that's not an option.

0xbadcafebee · on Jan 19, 2024

Database changes are typically one-way. If your new change includes creating or modifying a table, such that there are new additional columns, and you populate those with data, then downgrading would destroy the changed columns and the data in them. Hence you can't downgrade once you upgrade or you'd potentially be breaking things. To downgrade safely you'd need to backup or snapshot the old database, and then restore your database back to the backup/snapshot, but that's not blue/green.

fulafel · on Jan 19, 2024

DB schema migration script frameworks (at least in Python, Ruby & Java lands) do typically support both upgrade and downgrade directions. People skip implementing and testing the downgrade side if the development model doesn't need it but the problem of what happens to the data is controlled by what you put in the "down" migration script.

I'd guess if you can't throw the data away, you won't do a down migration, you'll do an up migration that changes the db to save that data in your preferred way before undoing or reworking the previous schema change.

eropple · on Jan 19, 2024

> DB schema migration script frameworks (at least in Python, Ruby & Java lands) do typically support both upgrade and downgrade directions.

They do, and in every shop I've ever been in these are considered a trap precisely because they don't consider data loss.

Always roll forward. If you have to change migration history, restore a backup and lament past you's hubris.

evanelias · on Jan 19, 2024

This is solved more cleanly in declarative schema management systems, where you have a schema repo of CREATE statements, and the tool can auto-generate the correct DDL. You never need to write any migrations at all, up or down. If you need to roll back, you use `git revert` and then auto-generate from there. The history is in Git, and you can fully leverage Git like a proper codebase.

A key component is that the schema management tool must be able to detect and warn/error on destructive changes -- regardless of whether it's a conceptual revert or just a bad change (i.e. altering a column's data type in a lossy way). My declarative tool Skeema [1] has handled this since the first release, among many other safety features.

That all said, schema changes are mostly orthogonal to database version upgrades, so this whole subthread is a bit different than the issue discussed several levels above :) The root of the blue/green no-rollback-after-upgrade issue discussed above is that MySQL logical replication officially supports older-version-primary -> newer-version-replica, but not vice versa. Across different release series, the replication format can change in ways that the older version replicas do not understand or support.

[1] https://github.com/skeema/skeema

fulafel · on Jan 20, 2024

I have two theories where people end up wanting them:

(1) Circumstances that for some reason enforce the requirement that sysadmin type ppl always have to be able to downgrade / roll back deployments without "the developers" producing new builds or sw artifacts. A separation of ops and dev teams, where you decide you need to survive an inability to make or procure new software builds on demand, and just dig up old deployment artifacts to use after down migration. There are a lot of wrong reasons to do this in inhouse sw settings, but also I guess the classic "we bought a 3rd party server app and plugged it into our onprem database", like Jira or something.

(2) Systems that are technically unable to recover from errors happening in db migrations (missing transactional schema change feature in db and/or application doing db related stuff that can't be rolled back at deployment time). So the down migration is more like a hand coded rollback for the migration that will be automatically run in the failure case of a deployment.

In both cases I can see how the "what happens to data in new columns" situation might still work out. In the (2) case it's sort of obvious, there's no new data yet. In the (1) case you live with it or choose the backup restore path - I can see scenarios where you decide it'd be much worse to restore from backup and lose people's entered data for couple of days, or however it took to find the showstopper for the upgrade, vs run the down migration and just lose new feature related data. (Which you could also rehearse and test beforehand with backups)

magicalhippo · on Jan 19, 2024

Our in-house schema migration tool supports downgrading, but it won't remove non-empty tables or columns etc.

For us this isn't a big deal though because we're writing our software so it should be able to function as expected on a DB with a newer schema. This makes upgrades much easier to handle has users can run new and old software side-by-side.

patmorgan23 · on Jan 19, 2024

Database migrations are always "fail forward" if there's an error you figure out what it was and fix it.

yellowsir · on Jan 19, 2024

but now you have 2 up paths. and migrations are critical, i would avoid it where possible!

Mavvie · on Jan 19, 2024

I wonder if that could be because MySQL 8's replication is backwards compatible but MySQL 5.7's isn't forwards compatible. If so, it makes sense that you're only able to move forward.

troublebucket · on Jan 19, 2024

Has anyone encrypted the storage on a previously unencrypted RDS using Blue/Green?

wisosim · on Jan 19, 2024

We did the exact thing not too long ago, but we couldn't do it using blue/green. We were able to minimize downtime quite a bit, but it was on the order of minutes, not seconds. I wrote a little bit about the process here. I spent a lot of time prepping and running the migration, so happy to share any details if it's helpful.

https://phizzle.space/dbadmin/aws/postgres/2023/12/30/rds-en...

troublebucket · on Jan 22, 2024

Nice article! Did you run into challenges using a DNS record for cut-over from the old DB to the new, encrypted DB?

bgschulman31 · on Jan 19, 2024

We recently did this on my team over Christmas this year. We opted not to use Blue/Green for this but instead spun up an encrypted snapshot and began replication from the old database to the new database using MySQL’s tools. Once traffic on our platform was sufficiently low, we cut connections to the database, waited for for replica lag to reach 0 and relaunch the servers with the new databases host info. Our downtime was around a minute.

troublebucket · on Jan 22, 2024

This is how we're planning to encrypt our RDS Postgres DBs, but with PG's built-in logical replication.

I tried it with DMS and it was so cumbersome. We would've needed to make a lot of schema changes to get DMS to function. Specifically DMS can't replicate a number of character column-types because they were considered CLOB.

d1str0 · on Jan 19, 2024

Good question. This was a pain point for my small team (me, myself, and I) a little while back. We had an unencrypted db deployed with CDK, and then tried to get it encrypted without losing data.

troublebucket · on Jan 22, 2024

Were you able to successfully complete the encryption and cut-over?

d1str0 · on Feb 5, 2024

Yes, we just ended up having to hardcode the db endpoint in the cdk stack after a manually restoration from snapshot into an encrypted db.

yeswecatan · on Jan 19, 2024

How did you go about stopping and restarting applications which reach out to the database? We have a number of tasks running in ECS which can take a minute to spin down and a few minutes to spin back up.

Ozzie_osman · on Jan 19, 2024

For our web service, we didn't stop anything. They had a few seconds of errors though it seems like some sessions were just buffered or paused and experienced high latency.

We also had background worker services. For the very high throughput ones, we spun down the # of tasks to a bare minimum for <5 minutes and let the queue build up, rather than have a massive amount of errors and retries. For the other ones where throughput wasn't high, we just let them be, and during the downtime they errored and retried and the retries mostly succeeded.

ThePowerOfFuet · on Jan 19, 2024

Presumably you don't stop them, and they throw errors during the cutover.

todd3834 · on Jan 19, 2024

You aren’t supposed to have to change anything in the application code. The same database URL should work.

shayonj · on Jan 19, 2024

+1 for Route53 Groups and B/G setups. We did something similar with PG upgrades, no downtime with AWS R53 groups & retry inflight queries with a custom Rails ActiveRecord transaction patch.

Trade off: For a few seconds some requests were slower.

DNS Groups w/ retries is a nifty mechanism for these things.

Tool used: https://github.com/shayonj/pg_easy_replicate

crucialfelix · on Jan 20, 2024

I'm planning to do this soon. Is there any problem skipping versions when upgrading? I would like to go from Postgres 12 to 16. Does Postgres handle that without worries?

sharadov · on Jan 19, 2024

Blue/green is new, so am guessing most folks don't know about it. It's the way for close to zero downtime upgrades.

Am waiting for them to support upgrades from RDS Postgres to Aurora.

benmorris · on Jan 19, 2024

I used this about 2 months ago going from MySql 5.7->8.0 Really awesome feature.

londons_explore · on Jan 18, 2024

There are various ways to 'pause' incoming postgres queries, for example using pgbouncer, - ie. don't fail them, simply delay them until the replication has caught up and then let them continue on the new database.

If anything goes wrong and replication doesn't catch up, you can unpause and let those queries happen on the old database.

Therefore, your 11 seconds of downtime becomes 0 to 11 seconds of added page load time. But more importantly, of the thousands of users of the database who have never seen a query fail before and might have buggy error handling codepaths or have a single failed query ruin a whole batch job, this approach leads to a lot less collateral damage.

whartung · on Jan 19, 2024

It’s one thing to pause queries but can you also pause transactions that are in flight? How does that work?

mbb70 · on Jan 18, 2024

Interesting to compare this to https://knock.app/blog/zero-downtime-postgres-upgrades discussed here https://news.ycombinator.com/item?id=38616181

A lot of the discussion boiled down to 'this is a lot of complexity to avoid a few minutes of downtime'. I guess this is the proof, just use AWS Data Migration Service, swap the DNS entries to go live and live with 11 seconds of downtime.

ris · on Jan 18, 2024

There is no "just" about it. The absolute key takeaway is in "what we learned":

> We chose to use DMS because it was well supported by the GOV.UK PaaS and we could also get support from AWS. If we were doing a PostgreSQL to PostgreSQL database migration in the future, we would invest more time in trying alternative tools such as pglogical. DMS potentially added more complexity, and an unfamiliar replication process than what we may have found with other tools. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.

The message here is not "just use DMS".

T-Winsnes · on Jan 19, 2024

Even AWS in their own docs says to use the native tools when migrating from postgres to postgres[1]. They don't go into the details to much and points to pg_dump rather than pg_logical, but interesting to see that they don't recommend using DMS for it

[1] https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source...

ris · on Jan 19, 2024

They do, but those recommendations are buried quite deep in the documentation, well behind all the marketing guff that suggests that DMS is all things to all people, and wonderful magic that is ideal for all situations.

btown · on Jan 18, 2024

Has anyone used https://cloud.google.com/database-migration/docs/postgres/qu... to do something like this? Does it work similarly to AWS DMS?

postpawl · on Jan 18, 2024

There are a lot of gotchas with using DMS (which seems to use pglogical under the hood). Since it’s not hardware-level replication, you can run into issues with large rows/columns/tables and it doesn’t really handle foreign keys. It may not handle some special data types at all. You also need to update the sequences after the migration or you’ll get errors about duplicate primary keys. You can also have issues if you don’t have proper primary keys, because it doesn’t always copy the entire row at once.

If the databases are within the same AWS account, it’s likely easier to use hardware-level replication with global database or snapshots to do migrations if you’re ok with 4-5 mins of downtime.

Twisell · on Jan 19, 2024

There are many options available with PostgreSQL you could also do a physical full backup + WAL level replication to keep key AND a get low downtime.

What might have oriented theirs choice is that they wanted to upgrade from major version 11 to 15 during the migration process. This is only available using logical replication. Otherwise you'd have to chain upgrade process of each major version (and possibly OS because 11 is EOL on some arch) and this is nor trivial nor quick.

pedrokost · on Jan 19, 2024

We recently completed the migration of a self-hosted 3 TB PostgreSQL database from version 12 to 16, transitioning from Ubuntu 18 to Ubuntu 22. Concurrently, we had to upgrade various extensions, most notably Timescale, for which a compatible version did not exist across all scenarios. We performed the upgrade by updating a replica in the following sequence:

- Start: PG12, Ubuntu 18, TS2.9

- Step 1: Set up a read-only replica with PG12 on Ubuntu 22, maintaining TS2.9.

- Step 1.5: Enter maintenance mode and halt all services.

- Step 2: Detach the the read-only replica, upgrading from PG12 to PG15 on Ubuntu 22 with TS2.9.

- Step 3: Upgrade from PG15 with TS2.9 to TS2.13 on Ubuntu 22.

- Step 4: Upgrade from PG15 to PG16 on Ubuntu 22 with TS2.13.

- Step 4.5 : Reconnect services to the new database server, resume all services, and exit maintenance mode.

All the database upgrade steps were well-tested and automated using Ansible. Nonetheless, we did encounter an issue that had not arisen during testing. This extended our downtime to approximately half an hour, which, for our use case, was perfectly acceptable.

Employing logical replication could have mitigated the last-minute surprise. So we will consider this approach for our next upgrade cycle.

pbrooko · on Jan 19, 2024

We recently followed an almost identical upgrade path (however at the time PG16 wasn't yet supported by Timescale, so stopped at PG15 + TS 2.12).

We did look into using logical replication to reduce the downtime of the upgrade, but because database schema and DDL commands aren't replicated, it seems that it isn't recommended with Timescale in the loop.. (I suppose the underlying schema changes that Timescale needs to make under the hood are mostly a function of your hypertable chunk sizing and what your incoming writes look like, so this could be planned around / timed well, but we felt it added too much potential complexity & risk compared to simply opting for a small maintenance window while pg_upgrade completed).

londons_explore · on Jan 18, 2024

Note that the enemy of low/zero downtime migrations like this is long running queries.

Ie. a single update query which takes 30 mins.

You either have to kill and roll back that query, or suffer 30 mins of unavailability.

As far as I know, there is no way to migrate a currently in progress query.

necovek · on Jan 19, 2024

For a software engineering project, you probably want to limit your transactions to much less than that (set statement_timeout is your friend). If you've got extremely long transactions, you can probably avoid doing the switch-over when they run (hopefully they are not a random occurrence but a result of a scheduled job or similar).

In combination with transaction time limit and fail-over configuration (where you fail the old primary), you can control the slowdown (instead of downtime, eg. with pgbouncer) very precisely.

I would be more concerned with the DNS TTL being respected in the entire stack (and external caching DNS servers you rely on), tbh.

But it is usually not critical to avoid a dozen seconds of downtime for an app, so whatever is simpler for you should be your go to solution.

Amezarak · on Jan 19, 2024

It's hard for me personally to imagine a 30-minute update query that is not written extremely inefficiently, or else a one-time huge data migration.

There are a lot of the former in the wild to be sure. I've had a lot of run turning minutes-hours into milliseconds. :)

londons_explore · on Jan 19, 2024

Data scientists... And when you look at the query, it'll be 800 lines of SQL...

hobs · on Jan 19, 2024

Heh, that's not so bad - 2 jobs ago I had to fix a KPI generation process that was 70k lines of dynamic sql, that unwrapped to up to 1m LOC :)

chaz6 · on Jan 19, 2024

Did you ever run into a hard limit of statement size?

hobs · on Jan 19, 2024

No, but all the code generation was happening in the server itself so I dont know how hard and fast it applies, iirc SQL Server's limits are like 64k x packet size.

callalex · on Jan 19, 2024

Well you’ve certainly introduced a teaching moment to me! What are the nature of writes you deal with that last 30+ minutes? What kind of data/what kind of people are involved with such DB writes where you need to rely on the DB engine to work so hard instead of something more split up by queues at a higher layer?

londons_explore · on Jan 19, 2024

Postgres queries are streaming. Ie. If the query result has 1 billion rows, but the client only has enough ram for 1 million rows, then the query will be slowed down until the client is ready to accept more rows.

If the client is slowly processing through the result set, then the query can take many minutes/hours.

londons_explore · on Jan 19, 2024

It's usually badly engineered things. However, just because it's badly engineered doesn't mean it's fine for it to break :-P

Things like a super complex n^3 complexity query to find all users who were 3 hops away from a known fraudster, where a 'hop' means 'shared any piece of account data in common'

chaz6 · on Jan 19, 2024

Would that be more suited to a graph database such as Neo4j?

ComputerGuru · on Jan 19, 2024

> The second change was to create a DNS record in AWS Route53 for `database.notifications.service.gov.uk` with a 1 second TTL (time to live) [..] our migration script just needed to update the DNS weighting in AWS to 100% of results being sent to the target database location and wait 1 second for the TTL to expire. Then, when our apps next try to query our database they will be querying our target database.

Wait. Their (or Python's default?) db orm blocks while it performs a DNS lookup for each and every query!? It doesn't cache resolved addresses for any length of time? No connections are pooled and reused?

necovek · on Jan 19, 2024

It would probably be the OS' `getaddrinfo` or `gethostname` that does this: Python rarely reimplements system level calls, which means it relies on the system's configuration.

If TTL of 1s was respected, they would be cached for 1s, but it's not uncommon for DNS query libraries and especially caching DNS servers to not fully respect TTL anyway: tbh, that might explain some of the downtime they've seen.

ComputerGuru · on Jan 19, 2024

I didn't mean it was directly implementing the networking call to the dns server -- just that it wasn't directly caching the result.

getaddrinfo(3) and getnameinfo(3) (guessing that's what you meant) don't implement caching, at least not explicitly in the spec and not normally in practice. On Windows, DNS results are cached by the OS but on Linux that would be distro-dependent behavior and usually requires setting up a local caching dns server (Ubuntu uses unbound out-of-the-box, iirc. Other choices include ncsd and dnsmasq).

Even if they implemented caching at the syscall level, this still assumes no connection stays open for more than 1s or is reused except per query. It seems like a big assumption (at least I hope it is, because I certainly wouldn't want my app to initialize a new db connection, let alone perform a DNS lookup, for every query).

juggertao · on Jan 19, 2024

They mentioned they had a script which terminated all connections to the old database then changed the password.

But on the app side you typically don't cache DNS, that creates other problems like stale DNS.

efxhoy · on Jan 19, 2024

Lovely! We just migrated from postgres 14 to 16 for 3 postgres clusters (servers) on RDS containing about 2TB of data across 8 databases. We were down from 00:00 to 04:00. Steps we took:

  * enabled our fallback "maintenance mode" site. It's a super lightweight version of our site running on CF workers.
  * scaled down all apps using the db to 0 in terraform
  * hit the upgrade button in the aws web ui, which runs pg_upgrade. 14->15
  * waited for it to finish
  * hit the upgrade button again. 15->16
  * waited for the dbs to start accepting connections (they do before they're marked as ready, I think aws does more stuff than pg_upgrade)
  * Started `VACUUM ANALYZE; REINDEX DATABASE CONCURRENTLY`. The idea is to avoid performance issues between versions and make use of performance improvements from new versions.
  * Started spinning up the apps again
  * Waited until all apps had a handful of containers running
  * Started accepting traffic (disabled maintenance site)
  * Went to bed

The REINDEX CONCURRENTLY happily chugged along for the biggest db for another 18 hours without blocking anything. Next time we're doing aws blue/green deploys to avoid downtime. We didn't this time since we weren't on 14.9 yet (the minimum minor version of 14 supported by blue green).

If I was doing this myself I wouldn't pay the AWS tax, instead do blue/green myself with logical replication and a load balancer.

sharadov · on Jan 19, 2024

I would just use pg_upgrade with --hardlinks for an in-place upgrade.

Have done 2 TB dbs in less than a minute.

We were running our own Postgres instances on-prem.

_xnmw · on Jan 18, 2024

Now all we need is for Amazon to come out with a "government-as-a-service" product for your budding nation state.

ris · on Jan 18, 2024

FWIW GOV.UK Notify is part of a suite of services offered by GDS to UK public sector bodies (along with GOV.UK Pay and GOV.UK PaaS) that was originally known as "Government As A Platform".

robin_reala · on Jan 19, 2024

…and the platform is all open source, so people are free to fork / steal. https://github.com/orgs/alphagov/repositories?q=paas

hinkley · on Jan 19, 2024

In partnership with pinkerton.com

sharadov · on Jan 19, 2024

DMS is a terrible migration tool, I spent almost a month fighting with various migration issues before I gave up.

It would not migrate text and json types. Even AWS support could not offer a solution.

We got in early testing AWS Blue/Green and that has made close to zero downtime upgrades a reality.

slyall · on Jan 19, 2024

If you think DMS is a bad migration tool then try using it for ongoing replication to an external destination.

Completely broken.

alerter · on Jan 19, 2024

Interesting, though I have no idea why the government is using AWS in the first place. This isn't a startup hacking away trying to find PMF, or dealing with unpredictable marketing-driven traffic spikes. We know we need these services running long term, and can make solid predictions about usage patterns.

We could build a public sector cloud and/or adopt a sensible on-prem approach. This requires funding, coordination and technical leadership, but would save the taxpayer an enormous amount over the long term.

Public sector IT is a disaster in general ofc, but I know there are good engineers working there.

berkes · on Jan 19, 2024

As a startup, my reason for using "cloud" (PaaS) is not to catch spikes, but because of focus. Every hour that I spend running around with HDDs, screwdrivers (or the cloud version thereof - storage, ansible, etc) is an hour that I'm not spending on stuff my customers need.

Why would this be any different for a government?

We don't expect our government to build their own cars, but to buy them from Volkswagen or Renault. Even when the government has a clear need for transport. Why do we then insist they build their own IT infrastructure?

0dayz · on Jan 19, 2024

Because the government is not a corporation and it's obligations are different from the private market.

berkes · on Jan 19, 2024

So should a government (e.g. the police) produce its own cars, rather then buy them? Or set up a factory to make laptops?

I'm not trying to be pedantic. I'm trying to understand where the line lies (and whether that line is universal for all governments on all layers in all countries).

objektif · on Jan 20, 2024

Your examples do not make sense on prem existed long before cloud and some shops were perfectly fine with it. If the police were producing perfectly fine cars and then they went for a car share program then I would argue that yes it may be reasonable for them to consider going back to producing cars.

ris · on Jan 19, 2024

> This isn't a startup hacking away trying to find PMF, or dealing with unpredictable marketing-driven traffic spikes. We know we need these services running long term, and can make solid predictions about usage patterns.

If you think government needs and demands are predictable, you don't follow politics (particularly uk politics in the last decade).

And then there are these things like pandemics that completely come out of left field. Being able to scale things on demand over the pandemic was one of the key demonstrators for use of the public commercial cloud by the public sector.

objektif · on Jan 20, 2024

Why do we assume scaling can happen only on cloud? Gov sites are usually very low tech. We are not mining crypto. I would guess even if they outright bought servers and just scrap them later on they would come ahead.

jamietanna · on Jan 19, 2024

The UK government as a whole operates across the big names in the public cloud as well as some on-Prem/colo.

I'd very much recommend watching https://youtube.com/watch?v=mpY1lxkikqM&pp=ygUOUmljaGFyZCB0b... from September about Gov.UK's various iterations and some of the migrations across cloud that they've had to do.

One thing about (at least UK government) is that procurement requirements means that they go to market for quotes around usage every few years. If ie Oracle Cloud was 1/10th the price, it would likely mean they'd win the deal, and so would have to migrate to Oracle for the duration of the contract, and then potentially do the same to another cloud if that was cheaper

StupidOne · on Jan 19, 2024

I'm sure a lot of countries in EU did this. First hand I know, they did this in Croatia as I was one of the developer who had to use it to deploy on it. The worst thing I have ever seen in my life. And I worked on a lot legacy apps written in VB.NET, Web forms, old Sharepoint, Basic and even when the whole app was one big mess of store procedures.

AWS, Azure, GC are at least written with thought about end users (us, developers) while government cloud was architectured, designed and built by the lowest bidder whose first goal was to cut and cut and cut his costs whenever possible.

rjzzleep · on Jan 19, 2024

I don't know about the UK, but AWS has has GovCloud in the US for a long time, and to be honest compared to a lot of infrastructure I have seen there it's a blessing. On the flipside, I've met some really amazing infrastructure and ops people in a German gov healthcare institution running the in-house DC, where the problem was neither the tech, nor the people, but 100% the management and their processes, and their desire to be the bottleneck for every single interaction between infrastructure and engineering teams.

j-krieger · on Jan 19, 2024

Disclaimer: I work in the public sector.

If you really think the public sector could build anything closely resembling any cloud, you are dreaming. Imagining that one cloud working for the entire public sector, we are entering delusional territory here.

Public sector projects are insanely expensive, take ages to develop and often flat out fail.

Not only that, we are starved for engineers even now. If we ran such a custom cloud, we would shoot ourselves in the leg by not being able to hire from a pool of experienced engineers.

alerter · on Jan 19, 2024

Oh I totally understand that. Public sector IT is littered with failed projects, massive overspends and lasting functional problems. There are numerous examples from the last few years alone - hundreds of millions spent on broken software. I used to work in public affairs and followed a lot of those projects closely.

I don't think this means "government = bad at tech" though. You sometimes see smaller in-house teams do really good work. The biggest issue seems to be with contracting and procurement policy. For example, on the Police Scotland i6 program they hired a bunch of consultancies to write a tender document, and then hired CGI for the final project. That turned out to be a copy of the Spanish system, which turned into a huge and expensive disaster as the requirements differed.

Feels like government has a similar problem to a lot of legacy non-tech companies. They don't have much technical leadership, don't understand the problem, and decide to hand it off to the lowest bidder. Doesn't help that they are often legally obliged to act this way. But the underlying engineering problems aren't unsolvable, and don't need to become a huge mess every time. (Your point about recruitment is fair though)

chasd00 · on Jan 19, 2024

The government likely hired a consulting firm and AWS was part of the solution proposed and bought.

plugin-baby · on Jan 19, 2024

This doesn’t sound like how GDS operates.

MrBuddyCasino · on Jan 19, 2024

If you think the public sector could improve its operational efficiency by building, operating and using their own cloud, I've go a bridge to sell you.

zeroclicks · on Jan 19, 2024

Let's ignore the technicalities completely. Here's a government providing a detailed look into their engineering practices, complete with graphs and configuration snippets. How many governments manage to publish such a thing?

ClassicOrgin · on Jan 19, 2024

Minor detail but did anyone else notice they are using eu-west-1? Kind of weird for the UK to be hosting there sites in another country (Ireland). I'm sure this isn't super sensitive data but still.

callalex · on Jan 19, 2024

It made a lot more sense before…you know…the thing that made no sense.

esskay · on Jan 19, 2024

They've been on AWS since before eu-west-2 was added, it's still not at the scale of eu-west-1. Capacity issues are still a thing (but getting much better), and only recently have they become pretty much feature parallel.

lozenge · on Jan 19, 2024

The London AWS region was missing a lot of features for a long time. It's easier to just go with a big region.

justinclift · on Jan 19, 2024

"Lets go with the easier approach" is a bit worrying when it means potentially breaking security assumptions / legal assurances.

Though I guess it could be explained away by having something in the website's "Terms and Conditions for use". ;)

juggertao · on Jan 19, 2024

It's a gov website. What are they going to do, sue themselves?

another2another · on Jan 19, 2024

Have a public enquiry, with lots of tea.

kitd · on Jan 19, 2024

I think one of the bits of carry-over legislation in the Withdrawal Agreement after Brexit was the UK continuing to comply with GDPR. So using an Irish DC is still ok.

saisrirampur · on Jan 19, 2024

Very clear and concise blog! Loved reading it. I'd be very curious to see how Amazon DMS performs at scale. Scale includes either of these situations: a) Larger dataset - 2+TB b) Higher throughputs - WAL Spikes (at least 2-3K TPS) c) Reliably and efficiently replicating TOAST Columns (ex: large JSONBs). d) Advance data-types - BLOB, CLOB, HSTORE, ARRAYs etc.

In my previous role with the Azure Postgres team, we observed that Azure DMS worked pretty seamlessly in migrations that did not involve the above situations. Once either of the above came in, the replication experience was hairy. Interestingly, in my current gig at PeerDB, many use-cases we are seeing have either of the above situations. We have been doing quite a surgical optimizations to handle those situations and they keep coming! That is the beauty of Postgres, enabling a myriad of workloads, each of the unique in their own way! :)

ryandv · on Jan 19, 2024

I've also used DMS to migrate smaller datasets (~200GB) from AWS RDS MySQL 5.7 to RDS PostgreSQL 11. 10 seconds of downtime here was actually incurred not due to the migration itself, but to enable MySQL binary logging on the RDS instance, which requires a restart, and which AWS DMS uses to replicate changes from the replication source (MySQL) to the target (PostgreSQL).

Traffic was steered to the new PostgreSQL instance not with DNS/Route 53 weighted records, but an application-level percentage enroller (based on user ID). Our particular set-up did in fact have apps talking to both databases simultaneously, each with a percentage share of traffic, and so we did not need to incur additional downtime to cut off traffic to the original source database - though now you do have to worry about eventual consistency.

I wouldn't recommend using their Schema Conversion Tool. We instead migrated the data as 1-to-1 as possible into the PostgreSQL target table, and then used plpgsql triggers on the replication target table to clean/validate the data and write it to another table with our desired (and more strongly typed - no more storing JSON as TEXT) schema.

There were also some issues with the replication task seeming to stall and stop during the change data capture (CDC) phase. As mentioned support is kind of spotty in this area, though we learned it may have to do with sizing the DMS replication instance correctly.

zImPatrick · on Jan 19, 2024

I really love how verbose and up to date the UK with tech

denysvitali · on Jan 19, 2024

They could have probably used pgpool [1] to avoid relying on the DNS (you never know in the chain who's not respecting the TTL).

With this, you can choose which queries are ending up where - and migrating the DB endpoint can be done once, centrally

[1]: https://www.pgpool.net/docs/latest/en/html/

aidos · on Jan 18, 2024

I love that where the rubber meets the road it doesn’t matter if you’re a government organisation or a web agency, it’s still just plain old Postgres, python, sqlalchemy and dns with a low ttl.

NegativeK · on Jan 19, 2024

To be fair, there's plenty of government orgs at the bottom and web agencies at the top.

mwarkentin · on Jan 20, 2024

Back when we first tried out DMS there was a fun bug where booleans were interpreted as strings, so all false records were interpreted as “false” and translated to true on the target DB. It was fixed shortly after but was a good reminder to validate your data during a migration.

jm2dev · on Jan 19, 2024

On a similar case AWS DMS did not work for my team and we proceeded with pglogical.

Although it has some limitations it worked pretty well, and personally I like that it is cloud agnostic. Definitively I would recommend to consider pglogical first.

speculator · on Jan 19, 2024

nice writeup. easy to read, easy to understand

gbraad · on Jan 19, 2024

They don't just sent the notifications but store them? Sounds like it might contain PPI as it records passport extension data, etc. Might be minimal, though over 1 billion rows (400GB) sounds massive trove to keep around.

M2Ys4U · on Jan 19, 2024

I'd highly doubt their database contains payment protection insurance.

gbraad · on Jan 23, 2024

Typo: PII, Personally Identifiable Information

oulu2006 · on Jan 19, 2024

We didn't even use red/green deployment; just had a multi-AZ deployment and migrated from 11.x -> 15.2 with about 30 seconds of downtime.

No dramas really, didn't even need logical replication.

ris · on Jan 20, 2024

I'll bet that wasn't across AWS accounts though.

globular-toast · on Jan 19, 2024

Is that a picture of them doing it?

What is this weird fantasy that programming looks like this? It's always the same: a couple of people, always of diverse backgrounds, physically interacting, usually pointing and laughing at something, with perhaps another group, representing different demographics, looking on with a mixture of curiosity and awe.

Do an image search for programming to see more examples of this.

I guarantee this work was the result of guys sitting at desks, possibly in the dark, thinking. Are we collectively embarrassed that this is what programming looks like? Or are these just fantasies of management types who have never written a line in their lives?

draaglom · on Jan 19, 2024

The primary goal of most large organisations' dev blog is to attract potential candidates to the top of the recruitment funnel.

This picture appears to be (presumably) the actual team doing a team building activity.

This is both a reasonable choice given the inferred instrumental goals of the blog, and a honest reflection of (one part of) the team life, assuming it's a photo of the real team.

beejiu · on Jan 19, 2024

They are clearly not programming. It's a team building game called the marshmallow challenge.

johneth · on Jan 19, 2024

Give it a rest, mate.

globular-toast · on Jan 19, 2024

Nah, this is the thing that leads to open offices etc. I want people to understand what programming actually looks like and these silly pictures need to go.

glintik · on Jan 20, 2024

11 seconds downtime - looks like devops had a break for smoking

DeathArrow · on Jan 19, 2024

Seems bad that government services depend on clod providers.

callalex · on Jan 19, 2024

Why? Do you deny the benefits of shared hosting in its entirety?

DeathArrow · on Jan 19, 2024

No. But I wish government agencies depend less on other entities. A government agency isn't in a business to make money. They should be concerned less with economic efficacy and more with accomplishing their goals.

They have a fiduciary duty towards the citizens.

mattsan · on Jan 19, 2024

Out of all the entities to depend on, AWS is tried and battle tested with US gov.

Additionally, a pandemic has shown that government services actually need to scale to arbitrary amounts, seemingly at random.

Don't you think that in this case using AWS is better to simply guarantee availability? Imagine tax payers money being spent on servers and then a pandemic happens and when the economy is screeching to a halt, more tax payers money is needed to expand the cluster, money that really would be useful to spend elsewhere.

eightnoteight · on Jan 21, 2024

> A government agency isn't in a business to make money.

true, but it's not in the business of wasting money either. if they don't take care about economic efficacy, then someday voters are going to judge them for wasting tax payers money

patmorgan23 · on Jan 19, 2024

And what if AWS is an efficient, efficiency, and affordable way to achieve that goal?

vfclists · on Jan 19, 2024

This doesn't seem to bode well for Oracle, but lobbying trumps the day.

Topgamer7 · on Jan 18, 2024

We tried to work with DMS to migrate mysql to postgres, and it was a nightmare. Support was useless, and would often just not get back to us without prodding. Then them giving us canned responses unrelated to our queries. The whole thing is nigh on un-debuggable.

Stay away.

mey · on Jan 18, 2024

I worked on migrating our MySQL system to PostgreSQL using pgloader ( https://pgloader.io/ ).

There were some hiccups, things that needed clarification in documentation, and some additional processes that needed to be done outside of the system to get everything we need in place, it was a amazing help. Not sure the project would've been possible without it.

Data mapping from PostgreSQL to PostgreSQL as in the article isn't nearly as bad as going between systems. We took a full extended outage and didn't preload any data. There were many dry runs before hand and validation before hand, but the system wasn't so mission critical that we couldn't afford to shutoff the system for a couple of hours.

Topgamer7 · on Jan 19, 2024

We also ended up using pgloader. Its not without its friction either. For example the concurrency/number of rows setting seems broken out the gate, and its like playing darts with a blindfold on to get it to run without running out of memory. But being able to read the source, github actions, and overall at least I could debug my problems, or find others who had the same issue.

Would recommend pgloader.

tibbon · on Jan 19, 2024

We also tried to use DMS for a few things (Rackspace to AWS migration, replication out to data lakes, etc) and it has been consistently undersupported, buggy and ate months of time before we went to other solutions. While a lot of AWS support has been good; not for DMS. It feels entirely half baked .

troublebucket · on Jan 19, 2024

+1 DMS is very half baked. Silent, unbuggable errors. Tons of unsupported LOB and CLOB data types. Built-in Postgres logical replication is way easier.

Spivak · on Jan 18, 2024

This was my experience as well. We thought it was a nice managed way to move pg to pg with minimal setup but we ran into so many issues we just did logical replication with bespoke fixes for things it didn't handle well.

extesy · on Jan 19, 2024

I can confirm that. Both DMS reliability and support quality were terrible.

rjh29 · on Jan 18, 2024

Tried it for mysql and it was flat out broken and silently corrupted data.

Topgamer7 · on Jan 19, 2024

For us it seemed to be trying to load data from the wrong table on mysql, into the wrong table on postgres.

NomDePlum · on Jan 19, 2024

Surprised by that. I've used AWS DMS quite a lot to do both on-prem to AWS and AWS (MySQL) to AWS Postgres migrations and long term ongoing replication. Whilst there is some complexity/gotchas there it's always been more than up to the task. Takes a little bit of validation/testing to understand but it's very well documented too.

What sort of issues did you hit? In all honesty I'm not sure I've been more impressed by another AWS service.

Topgamer7 · on Jan 19, 2024

It seemed to try to load data into the wrong table on postgres. That was the one that immediately comes to mind. Honestly poor support is what really killed it for us. But we had other technical problems with it.

We burned 3 weeks just trying to get support to provide a sensible response. I never got the sense anyone replying to us knew any more than the surface level about the infrastructure of how DMS worked.

NomDePlum · on Jan 19, 2024

We kept our table mappings pretty much mirrors of the source tables. Any data transformation we managed on the target cluster, not through AWS DMS.

I've used it fairly frequently over a number of years so maybe the issues we hit on the learning curve have dimmed.

We also deliberately kept things as simple as possible at first and focused on DMS's ability to move the data from source to target and really only tweaked settings that increased security or reliability. We stayed well away from any of the data transformation or more complex functionality.

tibbon · on Jan 19, 2024

One issue we hit were any schema changes totally messed it up. I don't have my notes in front of me, but we were constantly hitting data that wouldn't migrate, or that things suddenly broke whenever things changed.

NomDePlum · on Jan 19, 2024

Interesting we managed several schema changes when using AWS DMS for replicating data over a long period between MySQL and Postgres clusters.

We treated these carefully and tested as we made them but never had any real issues with them. From memory DMS could cope with adding columns pretty transparently. One setting we invested in configuring and understanding was to allow DMS to attempt to recover from replication failures. This allowed it to error on DDL changes and attempt to recover. This usually involved restarting the task, but it would do this transparently as part of the recovery.

Edwinr95 · on Jan 18, 2024

I'm quite negatively surprised that a government service is moving from their own platform to AWS for such an important service.

conception · on Jan 18, 2024

AWS has a lot of pre-audited compliance built into their services. Being able to inherit their certification for services can save an organization a lot of time and effort.

lnxg33k1 · on Jan 18, 2024

Its not an organisation, its a blucking government, it handles citizen data, and its sending them to a company of foreign country, because it can’t hire some system administrators? A GOVERNMENT? What are they doing? Still looking for their product market fit and can’t afford the headcount? Is it a joke?

EDIT If they are looking for money id like to participate a bit in the seed round

dijit · on Jan 18, 2024

Sysadmins are cheaper than many people seem to think.

I had a person I trust a lot telling me that "if we go with a bare metal provider like GCore we'd have to hire someone", his reason for bringing that up was that the cost difference would be justified by not having to hire someone,.

However a GCore €400,000k/y bill becomes a €6,000,000~ if you were to use a public cloud, even with the scaling up and down when not in use (we are an extreme case of needing a lot of dumb unreliable compute thats geographically distributed).

I can hire a lot of sysadmins for that money, but I probably don't even need one because public clouds also need devops staff to manage the complexity anyway.

DylanDmitri · on Jan 19, 2024

The risk is hiring a team of ineffective sysadmins, especially if your organization can’t assess sysadmin competence.

dijit · on Jan 19, 2024

That would indeed be a risk, but the circular logic of this means no new company could ever have any competence outside of its founders. Which feels shortsighted.

Anyway, I am a former sysadmin. I am confident that I can identify competence in the requisite areas.

criley2 · on Jan 19, 2024

Governments tend to be far less competent at determining technical competence. Due to a wide variety of factors, governments tend to be completely uncompetitive in salary for technical positions meaning they're already hiring from the lower end of the pool (not including a few altruistic folks willing to forgo their market value).

At a company if a department isn't working out you just restructure and move on, but in the government, that team is going to retire in your org and collect pension from you, and there's very little you can do about that.

dijit · on Jan 19, 2024

Everything you said seems to also apply for developers and the staff that would manage cloud resources.

Lack of cost control or effective use of a cloud provider leads to spiralling uncontrollable costs.

rcxdude · on Jan 19, 2024

yeah, every company I know of that uses cloud has a team responsible for managing it anyway, and they don't seem much smaller than the team needed to manage on-prem. I don't really think this 'savings' exists in most cases.

belter · on Jan 18, 2024

If you are worried about that, start with your government use of Microsoft Office and Windows who both send MB of data per minute to a US based company.

foofie · on Jan 18, 2024

> (...) its a blucking government, it handles citizen data, and its sending them to a company of foreign country, because it can’t hire some system administrators? A GOVERNMENT? What are they doing?

This is a very good question, and bears repeating.

It's not a massive database as well. 400GB with 1k inserts/second.

solatic · on Jan 19, 2024

> because it can’t hire some system administrators?

Spoken like someone who has never worked in the public sector. Hiring can easily take 6+ months or more due to an ever-increasing list of requirements that government HR is required to fulfill, not least of which is passing a security clearance which takes even more time. The best people on the market rarely have the patience for this. Once your employees do get hired - on-boarding can take another few/several/more months, getting various permissions, technical documentation, etc. Everything is out-of-date because making changes requires committee consensus, in a culture that is risk-averse, because nobody notices when you out-perform (after all, the requirements were also set by a committee that doesn't know who you are) but something going wrong is grounds for termination. Public sector work over-relies on hiring contractors precisely to shift blame for failure to the contractors. Managed database services are excellent tools to shift this kind of catastrophic risk of data loss to a contractor/vendor (who is managing the database).

Governments not owning their data isn't due to technical or budgetary limitations - it's strictly cultural.

NomDePlum · on Jan 19, 2024

Fully agree with this. I'd also add that a lot of IT is buy not build, in general. That includes support. Particularly true for the public sector and has been in place well before AWS existed.

Outsourcing the complexity to run and maintain a secure reliable database cluster really is making good use of the managed service model.

Dylan16807 · on Jan 19, 2024

> Hiring can easily take 6+ months or more

Do you think this move didn't take even longer to plan?

> to shift blame

That reason is much more plausible.

kunwon1 · on Jan 18, 2024

AWS has a G-Cloud for UK just like they have one for US, no?

snoman · on Jan 18, 2024

AWS has government specific regions (called GovCloud). Many services or features make it to GovCloud later than other regions because of the certification requirements.

travem · on Jan 18, 2024

AWS has US based GovCloud regions: AWS GovCloud (US-East) and AWS GovCloud (US-West). It does not have a UK specific GovCloud region that I am aware of.

NomDePlum · on Jan 19, 2024

No. It is only relatively recently (~5/6 years) AWS have had any data centres in the UK.

That blocked use of AWS for a lot of UK departments due to data sovereignty concerns.

kunwon1 · on Jan 19, 2024

I see. I don't use AWS much, I saw this [1] and assumed this was like the US g-cloud.

[1] https://aws.amazon.com/government-education/g-cloud-uk/

bboygravity · on Jan 18, 2024

Why can't the UK government build there own cloud?

It's just completely insane to me that they would make the gov internet infrastructure completely (geopolitically) dependent on another country AND just literally give all their (citizens') data away AND pay for that "privilege"?!

I mean if the government can't host the government's websites using tech from the government's country, maybe it would be better to just forget about the whole cyberweb thing altogether? Just turn it off?

ris · on Jan 18, 2024

I don't think you have any idea just how much it costs to run infrastructure at the reliability levels provided by AWS, and just how much investment it would require to get the ball rolling on this.

A lot of people have a very unrealistic picture of what government budgets are like.

bboygravity · on Jan 20, 2024

I get that it costs a lot.

My point is that NOT hosting it yourself (as a government) costs WAY more in the long run. See my points above.

The same goes for companies in Europe who literally host their trade-secrets (designs, sales, the entire company) on US-servers (OneDrive, Google Drive, etc). The US is the home of their competitors. Who cares about infrastructure costs if you're PAYING to give your trade secrets away to your competitor(s)?!

blibble · on Jan 18, 2024

> I don't think you have any idea just how much it costs to run infrastructure at the reliability levels provided by AWS

my $12/year VPS does better than us-east-1

hdlothia · on Jan 19, 2024

Where do you get a 12 dollar a year vps. Hetzner charges me 4 bucks a month and it feels like a steal

blibble · on Jan 19, 2024

https://lowendtalk.com/

quality varies

robertlagrant · on Jan 18, 2024

They'd still be outsourcing to a firm to do this. They wouldn't hire a load of people to do it in-house. See also Fujitsu in the recently-popular Horizon scandal, or the NHS for IT debacle[0].

[0] https://en.wikipedia.org/wiki/NHS_Connecting_for_Health

0xbadcafebee · on Jan 19, 2024

Why can't the UK government build their own cars? Their own boots? Their own pens, paper? How wasteful and pathetic that they wouldn't make all those things themselves. If it's possible to do it yourself, by golly, you should do it yourself, and there's absolutely no reason in the entire world to purchase those things from someone else instead.

bboygravity · on Jan 29, 2024

By that reasoning: why not just outsource the entire UK government to some low income countries while you're at it?

I mean I'm sure Xi or any other president would be happy to govern the UK in exchange for a small fee? Much more cost effective!

I'm sure the Chinese would also be more than happy to build the UK's roads and (government) buildings at a great discount.

You still don't see my point? We're talking essential infrastructure and Geo-politically highly sensitive data.

shagmin · on Jan 19, 2024

I've always wondered how beholden the world is to Microsoft. I was once surprised to learn the US military (and probably virtually all others) don't have their own OS to avoid being tied to a particular company.

vdaea · on Jan 18, 2024

Why should they build their own cloud, seeing that costs more money?

kakoni · on Jan 19, 2024

You know that UK's National Health Service did a deal with Palantir for "federated data platform"?

willsmith72 · on Jan 18, 2024

you want every government to build their own cloud? what in the world? the whole world is interlinked, should they also manufacture their own government laptops in the UK?

everfrustrated · on Jan 19, 2024

The UK replies entirely on the mercy of USA for its nuclear deterrent (Trident)

For the UK at least, that ship has _long_ since sailed....

blibble · on Jan 19, 2024

trident has UK built warheads and is operationally independent of the US

sph · on Jan 18, 2024

How would you feel if the US government ran on servers from a European company, which also works very hard to avoid paying taxes in US soil?

All those reasons to go AWS hold for a private company, not for a government service of a first world country and G7 member. AWS has a lot of compliant services, but it's not like they're doing rocket science one of the top 5 richest countries in the world cannot afford to develop or contract within its borders.

The simple reason is that the UK has been on a long trend of selling out to the highest bidder, whether they are US tax avoiding companies, chinese or managed by Russian oligarchs. We have chosen AWS for the same reason Post Office chose Fujitsu.

freedomben · on Jan 18, 2024

I would be surprised if they aren't deploying to the London data center, so I would think it is within the UK

dijit · on Jan 18, 2024

There's no govcloud in the UK; unless there are specific terms then the terms-of-service state that you are licensing either the irish entity or the american entity to have access and dominion of your data.

I had to spend a lot of time writing my privacy policy (perks of being CTO... yay), and part of that privacy policy was an admission that we transfer ownership of data to a US company (by using public cloud) despite using european datacenters.

This is because our agreement is with a US entity.

esskay · on Jan 19, 2024

eu-west-2 is a bit misleading, most of its nowhere near London, they've got DC's right up into the midlands. One of their newer ones for example is out in Didcot Oxfordshire, they've also got a few up towards Peterborough. All classed as 'London' despite being a fair distance away from it.

arpinum · on Jan 18, 2024

The blogpost shows a connection string to eu-west-1 in Ireland

kingkongjaffa · on Jan 18, 2024

> This is an AWS RDS PostgreSQL database and it lives in the PaaS’ AWS account. Our apps that run in the PaaS talk to this database. We are going to call this database our ‘source database’.

It already was. Read the article.

jazzyjackson · on Jan 18, 2024

I don't know what it's like in UK but it may be the case that government has a hard time a{ttract,fford}ing talent to administer everything in house. Not that AWS is great for cost saving but if its between paying 50k/year for cloud services and not being able to find an engineer who will competently do the job for less than 50k, then the cloud is your only move really.

swozey · on Jan 18, 2024

They require various clearances (digging into your life and past relationships to a miserable degree), don't allow someone to have ever smoked pot and pay half or less of what you can make in the pvt sector here (usa).

Everyone I know working FedRAMP jobs is prior military/g-level.

robertlagrant · on Jan 18, 2024

They wouldn't need that. And having been SC cleared in the UK, and known a few DV-cleared ones, at least in the UK they don't care if you've smoked pot. They just care that if you have, that you don't mind your family knowing one day. They don't want people who can be blackmailed.

swozey · on Jan 18, 2024

Here it's like this: Don't ever lie to them, "no matter what it is they'll find out."

So, some people don't lie, say they smoked pot in high school and none of them make it to the next step.

I had a twitter convo last year or pre-x whenever with the CTO of some org I can't remember (I don't think centcom, something much smaller) and he mentioned that they've lightened up quite a bit, or at least his program which was a softwar engineering group was more lenient. He was looking for engineers on via twitter on his official account.

So maybe that's loosening up here thankfully.

prmoustache · on Jan 19, 2024

Once your past the emerging startup status, running on the cloud involve as much engineers and complexity as running on prem if you want to follow best practices.

The "let's be managed and only hire developers" is a huge myth. All large organizations involve tons of "cloud engineers" or "devops" depending on how they want to call them and are just sysadmins with a different name and a bigger paycheck.

Having actual datacenters doesn't add a ton of complexity and datacenters themselves are often managed by people who don't even have an engineer paycheck. The main difference between being on prem vs cloud is you have to plan (how many servers/storage/network equipment you have to buy and replace on the following year) and pay for stuff (like space, racks) more in advance + take into accounts delays in delivery. This is where cloud makes the job much faster for companies but given the slow pace at which gov stuff happen usually I don't think this is a problem for them.

solatic · on Jan 19, 2024

> and not being able to find an engineer

Remember it's not just about being able to find one single engineer - then they become key-person risk. You need multiple engineers to be able to handle the loss of that engineer, either temporarily (vacation) or permanently (suddenly hit by a bus). Then you end up having a team of DBAs. Then you have functional rather than feature teams. Then you need multiple managers to align to get anything done, and have internal politics.

Being able to consume databases as a product has non-trivial value.

justsomehnguy · on Jan 18, 2024

> who will competently do the job for less than 50k, then the cloud is your only move really

Well, there is the other way, but, as we know, never ever that would happen.

pmcp · on Jan 18, 2024

As somebody who worked for the European Commission, and a european national government, I agree with your sentiment, but the harsh reality is that government divisions in generally work on a shoe string budget, when it comes to decisions like these. I wouldn’t be surprised if this was a “best effort given the circumstances” move.

NomDePlum · on Jan 19, 2024

Why?

I've worked on a number of UK government projects, including some with particularly sensitive security and data requirements.

Having some knowledge of their on-prem data centres and UK Cloud offering they have also used moving to AWS has so many operational, security and resilience benefits that aren't available elsewhere. It's not a free-lunch by any means and needs thought and governance certainly but the procurement simplification benefits alone make going to the public cloud a no brainer for a lot of government services.

It is worth knowing that even the on-prem data centres are usually operated by 3rd parties such as HP, BT and IBM. There was an initiative to have "Crown-managed" data-centers but it's not particularly scalable.

otteromkram · on Jan 18, 2024

I'd argue that AWS is much better suited than self-hosting because it's such an important service.

Downtime becomes negligible and global reach vastly increases with comparably little cost.

ris · on Jan 18, 2024

GOV.UK PaaS also runs on AWS (for as long as it remains to exist)

overstay8930 · on Jan 19, 2024

If you saw how non-tech companies run datacenters, well let's just say they're not exactly working with NATO like the big 3 cloud providers do when designing their DCs and backbone.

Honestly you should be frightened when you see someone NOT using a cloud provider, because it is hard work to properly run and secure a datacenter. Even Equinix fucks up HARD regularly and they are considered the gold standard (shout out to those I saw at 350 E Cermak over the weekend).

0xbadcafebee · on Jan 18, 2024

That sentence was a little confusing. You're not happy that the government is hiring experts to run an important service?

dtnewman · on Jan 18, 2024

Yes. RDS is a very reasonable choice if you are a tech company, let alone a govt org. The alternative isn’t “let’s host this ourselves” it is “let’s host this with Oracle at a much higher cost”.

tobias_irmer · on Jan 19, 2024

It isn't? AWS is crazy expensive and you don't have as much control over things as you may occasionally need. The best decision we took in the past few years with regards to infrastructure was moving away from AWS and doing everything ourselves.

On RDS we had inexplicable spikes in cost, deteriorating support and no real support for any of our issues. When we tried using DMS, it just didn't work as expected, even after spending two days on the phone with their support.

15457345234 · on Jan 19, 2024

The alternative - at government scale - is absolutely 'let's host this ourselves' and that's what they should be doing, to ensure that institutional expertise remains. They should also own and operate their own datacentres which should be physically secure, not shared with commercial ventures and guarded by the armed forces, not civilian security.

0xbadcafebee · on Jan 19, 2024

Why doesn't the government manufacture their own cars? They're going to lose institutional expertise in building cars! They should also own and operate their own manufacturing facilities which should be physically secure, not shared with some 'civilian commercial venture'.

By golly, the government can't do business if it isn't a datacenter operations company, a software vendor, and a car manufacturer.

15457345234 · on Jan 19, 2024

> By golly, the government can't do business if it isn't a datacenter operations company, a software vendor, and a car manufacturer.

I mean, precisely, which is why some countries are rapidly turning into failed states. Too much buck-passing and outsourcing.

15457345234 · on Jan 19, 2024

> is hiring experts

'moving to AWS' (or any cloud provider) is not 'hiring experts' it's just outsourcing the risk to an entity that you, in the event of a genuine crisis, have no leverage over beyond 'we're going to stop paying you (once we migrate away from you which will take ten years)'

0xbadcafebee · on Jan 19, 2024

AWS are not experts at providing computing services? Holy cow. This is news to me! I thought they were the most popular and highly regarded computing infrastructure and PaaS company in the world, managing both hardware and software and providing subject matter experts to work with customers on architecture and implementation, along with official partners and a marketplace of turn-key products.

Boy, am I embarrassed! I need to start building my own datacenter right away, all my shit is on AWS!!!

rcxdude · on Jan 19, 2024

They may be experts, but you sure aren't hiring them. You are renting something from them.

okasaki · on Jan 18, 2024

All UK businesses run on Oracle and Microsoft, so I'm not sure why you're surprised. They have us by the balls.

mobilemidget · on Jan 19, 2024

"The PaaS team offered us the ability to migrate databases using AWS Database Migration Service (DMS)."

And I'm not surprised if, they got some kickback, discount etc in some way to promote AWS on their blog. Not claiming its so, but I would not be surprised at all. It reads as one big advertisement.

ivix · on Jan 18, 2024

I'm surprised that you're surprised. Why on earth would government not be migrating to the cloud?

msla · on Jan 19, 2024

It incentivizes public-private cooperation: If the government cracks down on Amazon, Amazon turns off the government's AWS accounts and deletes the data. The government finds that subpoenaing a wiped hard drive is utterly nugatory, and thereby learns humility.

_joel · on Jan 18, 2024

There's an absolute ton of stuff on AWS. There used to be gCloud that allowed for smaller clouds to tender for government contracts bit there was a big pull to AWS, at least from my experience with it.

pixelesque · on Jan 18, 2024

As other comments point out, their own platform was (at least in terms of DB) already running on AWS, just using a different account.

foofie · on Jan 18, 2024

> As other comments point out, their own platform was (at least in terms of DB) already running on AWS, just using a different account.

That changes nothing. It just means this unjustifiable nonsense is going on for a while.

cpursley · on Jan 18, 2024

And an American one, at that (we’re talking government services here, not some SaaS). Are there really no native UK cloud providers?

cameronh90 · on Jan 18, 2024

Not any remotely comparable. The small “cloud” providers we do have were just reselling vSphere last time I looked into it.

karol · on Jan 18, 2024

Hope the picture is illustrative only.