How can it take 3-4 months to get an eCommerce site back online? I assume you could redeploy everything from scratch in less time if you have source code and release assets. With backups and failover sites I can’t think of any world where this would happen?
It isn't surprising at all. There's a reason why tech companies have insanely large engineering teams even though it feels to an outsider (and inept management) that nobody is doing anything. It takes a lot of manpower and hours to keep a complex system working and up to date. Who validates the backups? Who writes the wikis? Who trains new hires? Who staffs all the on-call rotations? Who organizes disaster recovery drills? Who runs red team exercises? After the company has had repeated layoffs and fired, outsourced or otherwise pushed out all this "overhead" eventually there's no one remaining who actually understands how the system works. One small outage later, this is exactly the situation you end up in.
Sure, but for every efficiently run company, there’s another with 80% of its engineers working on a “new vision” with zero customers, while the revenue-generating software sits idle or attended by one or two developers…
And maybe this is intentional, rational strategy - why not reinvest profits in R&D? But just because an organization is large does not mean that it’s efficient.
- Everybody in their right mind agreed that, for what they were achieving, Twitter was completely over-staffed. Like most of the big tech co in this period. And like most of those co, they went through a leaning program with mass layoffs.
- If the service is running fine with only 10% of the staff, it doesn't necessarily means that the 90% that got fired were useless. I can get a 6yo to heat their food using a microwave. Does it mean that the kid is a genius, or that the people who made the microwave did it in a way that allows a kid to operate it, even though it's a complex system at its core?
- Comparing Twitter to an international eCom website is disingenuous. If "design Twitter" is a common system design interview question, it's not because the website is popular, it's because the basics are quite simple. Whereas, behind an eCom website, there's dozens of parts going on at any time, with hundreds of interoperability issues. You're not mainly relying on your main DB for your data, most of it is coming from external systems.
"The basics are simple" ... hrm.
I think the concepts of how Twitter works is simple; but during their "fail whale" days, they had to reinvent things on the fly to achieve scale and reliability.
Twitter used to have lots of moving parts, and money flowed from various ads and placements, and that was much more complex then I think people appreciate. With their new head twit, they destroyed most of their ad revenue stream and are now hyping paying for APIs and posting privileges, and it shows.
My main irritation is that people say it "works fine", when lots of crap is broken all over, and it now has regular outages. They have shipped like ONE feature that was mostly done earlier since the takeover.
Yep. It takes way fewer people to operating a working system than to build a new one. And the nature of capitalism is that you will pare down your numbers until you have the absolute minimum staffing you need to keep the lights on. Then when everything explodes, you completely lack the know-how to fix it. Then the CEO yells as the tech executive who responds by demanding hourly updates from the two junior devs who operate the site, and nobody wants to admit that they aren't capable of fixing it, and nobody's gonna OK a really expensive "we're gonna spend a month emergency building a new thing" plan because nobody's okay with because a month is obviously way too much time you need to fix it right now, and then three months go by and here you are.
A friend of a friend told me about an organization that has a steady income from existing products maintained by just enough engineers to keep the lights on, while the other 80% of the organization is building the “new version” that no customer asked for and that nobody is currently paying for. There’s one product that is used by more than 80% of customers that’s maintained by 2 developers and that the CEO isn’t aware even exists.
Ya I've been there. I even tried pitching to management that a small team of us wanted to move to the legacy product and iteratively improve it because it had customers and revenue and we could make an impact while the new product was under development. They said no. I left about 6 months after. 9 years later the legacy product is still running. I can't find any evidence that they launched a new one.
I get the opposite impression. Stale software organisations with steady operating products seem to use massive headcounts, whereas startups building new products often get by with relatively few people.
Startups don't have to run a software stack for decades, hardware refreshes or SKU updates and replatforms, dealing with multiple types of turnover and reogs, knowledge transfer, etc.
Plus at least monthly if not daily, even hourly system patching.
I'm sorry but if an enterprise team can't at least get a stopgap ecommerce site up and running in a week, what are you even doing? Literal amateurs can launch a WooCommerce site from nothing in a weekend; two Stanford grads in YC can do a hundred-fold better than that. Yes, a big site is more complicated, maybe there will be some frazzled manual data entry in Excel sheets while your team gets the "real" site back up, but this is total madness.
> Literal amateurs can launch a WooCommerce site from nothing in a weekend
Selling low-volume horseshit out of your garage is in no way comparable to running a major eCommerce site.
> two Stanford grads in YC can do a hundred-fold better than that.
No they literally can't.
> Yes, a big site is more complicated, maybe there will be some frazzled manual data entry in Excel sheets while your team gets the "real" site back up
Great idea, we'll have Chloe in Accounts manage all the orders in a million-row Excel sheet. Only problem might be they come in at 50 orders a minute, but don't worry I hear she's a fast typist.
Your comment suggests that you're not familiar with the diversity in M&S' operation.
Marks and Spencers started as a department store; they still have this operation. They sell clothes, beauty products, cookware, homeware and furniture. All these things are sold in physical shops and online. Most of this is straightforward for an e-commerce operation, but the furniture will involve separate warehousing and delivery systems.
They also offer financial services (bank accounts, credit cards and insurance). These are white labelled products, but they are closely linked to their loyalty programme (the Sparks card).
Finally, they have their food operation: M&S is also a high-end supermarket. You can't do your food shop on the M&S website (although their food products are available from online-only supermarket Ocado), but you can order some food products (sandwich platters and party food) and fresh flowers from the website.
So M&S is a mid-tier department store and a high-end supermarket. These are very different styles of retail operation: supermarkets require a lot of data processing to ensure the right things get to the right shops at the right time to ensure that food doesn't go to waste but also shoppers aren't annoyed by the unavailability of staples like bread and milk.
Finally, M&S is traditionally fairly strong in customer service; it's not exactly Harrod's or Fortnum and Mason's, but their bra-fitting service, for example, has a legendary reputation. The internet isn't their natural home.
So all-in-all, you have a business doing complicated things online because they think they have to, not because they want to: a pretty clear recipe for disaster.
How do you know it's safe to redeploy? If your entire operation may be compromised, how can you trust the code hasn't been modified, that some information the attackers have doesn't present a further threat, or that flaws that allowed the attack aren't still present in your services? It's a large company so likely has a mess of microservices and outsourced development where no-one really understands parts of it. Also, if they get compromised again it would be a PR disaster.
They're probably having to audit everything, invest a lot of effort in additional hardening, and re-architect things to try and minimise the impact of any future attack. And via some bureaucratic organisational structure/outsourcing contract.
You literally have some of your team buy new laptops and hang out in a temporary wework to set it up on entirely new infra, air-gapped from your ongoing forensic exercise. You just need to make sure none of the people you send are dumb enough to reuse their password. You need to take the domain name, but they will be using one of the high end domain companies so that can be handled.
Bear in mind that this is a company which still sells physically and has retail and warehouse staff. All that the e-commerce side needs to do is issue orders of what skus to send to what addresses, and pause items that are out of stock. M&S is not Amazon and doesn't have that many SKUs, 5 people could probably walk round the store in a few days and photograph all of them for the new shopping site.
Sure, customers will need to make a new account or buy as a guest. But this stuff is not hard on the technical side. There is no interaction between customers like a social media site, so horizontal scaling is easy.
Now I get that there are loads of refinements that go into maximising profit, like analytics, price optimization, etc. But to get in revenue these guys don't even need to set up advertising on day one because they have customers that have been buying from them for decades. The time to set up all that stuff is when your revenue is nonzero
> M&S is not Amazon and doesn't have that many SKUs, 5 people could probably walk round the store in a few days and photograph all of them for the new shopping site.
I can't speak about M&S buy all big physical retail brand which started selling online are exactly operating as Amazon with SKUs coming from various third party entities. The offering is much bigger than what is sold at the physical shops.
I had the impression that M&S wasn't, but if that's the case then yeah, that would invalidate my analysis. Especially if even their retail stock goes through that route when bought online.
HN posters love talking gangster shit when something goes offline but never walked a mile in their boots.
I most recently remember sifting through gloating that 4chan - a shoestring operation with basically no staff - was offline for a couple weeks after getting hacked.
I've worked at a shop that had DR procedures for EVERYTHING. The recovery time for non-critical infra was measured in months. There are only so many hands to go around, and stuff takes time to rebuild. And that's assuming you have procedures on file! Not to mention if there was a major compromise you need to perform forensics to make sure you kick the bad guys out and patch the hole so the same thing doesn't happen again a week after your magical recovery.
And if you don't know, you shut it down till it's deemed safe. How do you know the backups and failover sites aren't tainted? Nothing worse than running an e-commerce site processing customer payment card data when you know you're owned. That's a good way to get in deeper trouble.
I'm not that surprised, though 3-4 months does feel like a long time.
When I was at early Twilio (2011? 2012? ish), we would completely tear down our dev and staging environments every month (quarter? can't remember), and build them back up from scratch. That was everything, including databases (which would get restored from backup during the re-bring-up) and even the deployment infrastructure itself.
At that point we were still pretty small and didn't have a ton of services. Just bringing my product (Twilio Client) back up, plus some of the underlying voice services, took about 24 hours (spread across a few days). And the bits I handled were a) a small part of the whole, and b) some of the easier parts to bring up.
We stopped doing those teardowns sometime later in 2012, or perhaps 2013, because they started taking way too much time away from doing Actual Work. People can't get things done when the staging environment is down for more than a week. Over the following 10 years or so, Twilio's backend exploded in complexity, number of services, and the dependencies between those services.
I left Twilio in early 2022, and I wouldn't have been surprised if it would have taken several months to bring up Twilio (prod) from scratch at that point, though in their case it would be a situation where some products and features would be available earlier than others, so it's not really the same as an e-commerce site. And that was when I left; I'm sure complexity has increased further in the past 3 years.
Also consider that institutional knowledge matters too. I would guess that for all the services running at Twilio, the people who first brought up many (most?) of them are long gone. So I wouldn't be surprised if the people at M&S right now just have no idea how to bring up an e-commerce site like theirs from scratch, and have to learn as they go.
The Co-Op (grocery store chain) was hacked around the same time in likely the same incident. It took three weeks for them to get food back on the shelves at my local store. I don’t understand how that’s even possible… what happened to all the meat and vegetables in the supply chain? They just stopped flowing? They rotted? Why couldn’t they use pen and paper? It’s unbelievable to me that a business would go three weeks without stocking inventory.
You could (and people did) run this in the pre-internet days with basically just phone calls and a desk to receive them. The problem is that by now this represents an incredible increase in manpower required overnight.
And you need a process to follow. You can't just have nearly 4000 supermarkets ringing up HQ at random and reading out lists of 1000 items each. Then what? Back when a supermarket chain did operate like that, the processes like "fill in form ABC in triplicate, forward two to department DEF for batching and then the forward one to department GHI for supplier orders and they produce forms XYZ to send to department JKL for turning into orders for dispatch from warehouses". And so on and so on. You can't just magic up that entire infrastructure and knowledge even if you could get the warm bodies to implement it. Everyone who remembers how to operate a system like that is retired or has forgotten the details, all the forms were destroyed years ago and even the buildings with the phones and vacuum tubes and mail rooms don't exist.
Of course you could stand up a whole new system like that eventually, but you could also use the time to fix the computers and get back to business probably sooner.
But I imagine during those 3 weeks, there were a lot of phone calls, ad-hoc processes being invented and general chaos to get some minimal level of service limping along.
I agree, although it seems like a failure of imagination that this is so difficult. The staff will have a good understanding of what usually happens and what needs to happen. What they are lacking is some really basic things that are the natural monopoly of "the system".
Perhaps we need fallback systems that can rebuild some of that utility from scratch...
* A communication channel of last resort that can be bootstrapped. Like an emergency RCS messaging number that everyone is given or even a print/mailing service.
* A way to authenticate people getting in touch using photo ID, archived employee data or some kind of web of trust.
* A way to send messages to everyone using a he RCS system.
* A way to commission printing, delivery and collection of printed forms.
* A bot that can guide people to enter data into a particular schema.
* An append only data store that records messages. A filtering and export layer on top of that.
* A way to give people access to an office suite outside of the normal MS/Google subscription.
* A reliable third party wifi/cell service that is detached from your infrastructure.
* A pool of admin people who can run OCR, do data entry.
Basically you onboard people onto an emergency system. And have some basic resources that let people communicate and start spreadsheets.
part of the problem with emergency systems is that whatever emergency system is going to take you from zero to over capacity on whatever system it is, particularly if you are requiring communication from suddenly over-burdened human staff working frantically, and these processes may break down because of that.
> Everyone who remembers how to operate a system like that is retired or has forgotten the details
Anyone who’s experienced the sudden emergence of middle management might feel otherwise :) please don’t teach those people the meaning of “triplicate,” they might try to apply it to next quarter’s Jira workflows…
I remember when I was a teenager working the register at a local store. The power went out one day, and we processed credit cards with a device that imprinted the embossed card number onto a paper for later reconciliation.
That wouldn’t work today for a number of reasons but it was cool to see that kind of backup plan in place.
In the UK the credit / debit cards I've had issued in the last few years have been flat, with details just printed, so that level of manual processing is presumably defunct here.
Don't forget chip & PIN is state of the art novel tech in the US. (From memory I think it was required here in the UK from Valentine's day^ in something like 2005.)
(^I remember the day better than the year because the ad campaign was something like 'I <3 PIN'.)
that is mostly because major US retailers sued Visa/Mastercard to make it not enforceable via lower interchange fees, since then they would have to change tens of thousands of point-of-sale systems at each one
In my case all the perishable shelves were empty - no fruit, no vegetables, no meat, no dairy. I checked every few days for multiple weeks and it wasn’t until three weeks after the incident I was able to buy chicken again.
It’s possible they were ordering some default level of stock and I just didn’t go at the right time to see it, but it sure looked like they were missing the inventory… when I first asked the lady “is the food missing because of the bank holiday?” and she said “no because of the cyber attack” I thought she was joking! It reminded me of the March 2020 shelves.
Interestingly Co-Op is so-called because it’s a cooperative business, which vaguely means it’s owned by its employees, and technically means it’s a “Registered Society” [0].
If you check CompaniesHouse [1], which normally has all financial documents for UK corporations, it points you to a separate “Public Register” for the Co-Op [2].
So, your comment has more basis in reality than simply being snark… the fact that “nobody is incentivized to care” is actually by design. That has some positive benefits but in this case we’re seeing how it breaks down for the same reasons nobody in a crowd calls an ambulance for someone hurt… it’s the bystander effect applied to corporate governance with diluted accountability.
I’m not following your logic. The co-op is designed for everyone to care _more_ because they are part-owners and because the organisation is set up for a larger good than simple profit-making.
In practice the distinction has long been lost both for employees and members (customers), but the intent of the organisational structure was not for nobody to care; quite the opposite
But there are millions of part-owners. Every “member” of co-op (i.e. a customer in the same membership program that just lost all their data to this hack) is an owner of it. Maybe the employees get more “shares” but it’s not at all significant.
And at the executive governance level, there are a few dozen directors.
There is a CEO who makes £750k a year, so it has elements of traditional governance. I’m not saying the structure is entirely to blame for the slow reaction to the hack, or that there is zero accountability, but it’s certainly interesting to see the lack of urgency to restore business continuity.
My family used to own a local market, and as my dad said when I told him this story, “my father would have been on the farm killing the chickens himself if that’s what he had to do to ensure he had inventory to sell his customers.”
You simply won’t get that level of accountability in an organization with thousands of stakeholders. And a traditional for-profit corporation will have the same problems, but it will also have a stock price that starts tanking after half a quarter of empty shelves. The co-op is missing that sort of accountability mechanism.
Exactly, the bystander effect. But it’s not strictly due to the large size. Other big companies get hacked too. But if they have a stock price then there’s an obvious metric to indicate when the CEO needs to be fired. It’s the dilution of responsibility combined with a lack of measurable accountability that causes the dysfunction.
The problem is that cutting IT and similar functions to the bone is really good for CEOs. It juices the profits in the short/mid term, the stock price goes up because investors just see line go up, money goes in, and the CEO gets plaudits. There's only one figure of merit: stock price. What you measure is what you get.
It's only much later that the wheels fall off and it all goes to hell. The hack isn't a result of the CEOs actions this quarter, it's years and years of cumulative stock price optimisation for which the CEO was rewarded.
And you can't even blame all the investors because many will be diluted and mixed though funds and pensions. Is Muriel to blame because her private pension, which everyone told her is good and responsible financial planning, invested in Co-Operative Group on the back of strong growth and "business optimisation intiatives"? Is she supposed to call up Legal and General and say "look I know 2% of my pension is invested in Co-Op Group Ltd and it's doing well, and yes I'm with you guys because you have good returns, but I'm concerned their supermarket division is outsourcing their IT too much, could you please reduce my returns for the next few years and invest in companies that make less money by doing the IT more correctly?"
There is a serious crisis of competence and caring all throughout society and it is indeed frightening. It’s this nagging worry that never goes away, while little cracks keep appearing in the mechanisms we usually take for granted…
Buying and distributing vegetables for stores is not remotely a simple thing. It includes statistical analysis with estimates of demand for every store, seasonal scheduling, weather awareness, complicated national and/or international logistics, plus accounting and payments.
Some or all of those may be broken during a cyberattack.
That’s a good point but perhaps you underestimate the ingenuity borne from constraints.
If you’ve got trucks arriving with meat that’s going to expire in a week, and all your stores have empty shelves, surely there is a system to get that meat into customer mouths before it expires. It could be as simple as asking each store, when they call (which they surely will), how much meat they ordered last week, and sending them the same this week. You could build out more complicated distribution mechanisms, but it should be enough to keep your goods from perishing until you manage to repair your digital crutch.
The suppliers will know and be able to predict what a large customer like M&S is likely to order. They will probably be preparing items before they are even ordered. And surely their must be some kind of understanding of what a typical store will receive.
So you haven’t dealt with ransomware gangs yet? Because they have gotten sophisticated enough to nuke source code repos and backups and replicated copies.
It’s part of the reason tape is literally never going to die for organizations with data that simply cannot be lost, regardless of rto.
For this particular audience, it's one of those things that could be rewritten in Rust over a weekend and then deployed on the cheap via Hetzner. At least then it'll be memory safe!
of course, if you redeployed everything from the source code, you could very well still have the same vulnerabilities that caused the problem in the first place..
There are no backups. There are no failovers. There is no git. There is no orchestration and deployment stratagies. Programmers ssh into the server and edit code there. Years and years of patchwork on top of patchwork with closely coupled code.
Such is a taste of what needs to be done if you wish to have a service that takes months to set back up after any disruption.
This is a perfect description of how things work at one of the largest health care networks in the northeast US (speaking as someone who works there and keeps saying "where's the automation? where are the procedures?" and keeps being told to shut up, we don't have TIME for that sort of thing.
lol the healthcare industry was definitely in my mind as I wrote this. Never worked there but I read a lot of postmortems and it shows whenever I use their digital products. Recent example is CVS.
Somehow, at some point, they decided that my CVS pharmacy account should be linked to my Mom's extracare. Couldn't find any menu to fix it online. So the next time I went to the register I asked to update it. They read the linked phone number. It was mine. Ok, it is fixed, I think. But then the reciept prints out and it is my mom's Extracare card number. So the next time I press harder. I ask them to read me the card number they have linked from their screen. They read my card number. Ok, it is fixed, I think. But then the reciept prints out and the card number is different—it is my mom's. Then I know the system is incredibly fucked. Being an engineer, I think about how this could happen. I'm guessing there are a hundred database fields where the extracare number is stored, and only one is set to my mom's or something. I poke around the CVS website and find countless different portals made with clearly different frameworks and design practices. Then I know all of CVS's tech looks like this and a disaster is waiting to happen.
Goes like this for a lot of finance as well.
E.g. I can say with confidence that Equifax is still as scuffed as it was back in 2017 when it was hacked. That is a story for another time.
Nobody bothers to keep things clean until it is too late. The features you deliver give promotions, not the potential catastrophes you prevent. Humans have a tendency to be so short sighted, chasing endless earnings beats without anticipating future problems.
Sorry if I phrased it poorly. I wasn’t definitively saying that all these things are the case. But what always is the case is that when an attack takes down an organization for months, it was employing a tremendous number of horrendous practices. My list was supposed to be some.
M&S isn’t down for months because of something innocuous like a full security audit. As a public company losing tens of millions of dollars a week, their only priority is to stop the bleed, even if that means a hasty partial restoration. The fact they can’t even do that suggests they did stuff terribly wrong. There’s an infinite amount of things I didn’t list that could also be the case. Like if Amazon gave them proprietary blobs they lost after the attack and Amazon won’t provide again. But no matter what they are, things were wrong beyond belief. That is a given.
To be fair, I would be that nearly every organization employs a tremendous number of horrendous practices. We only gasp at the ones who get taken down for some reason.
Horrendous practices exist on a spectrum. Every org has bad code that somebody will fix someday™. It is reasonable to expect that after a catostrophic event like this, a full recovery takes some time. But at a "good" org, these practices are isolated. Not every org is entirely held together with masking tape. For the entire thing to be down for so long, the bad practices need to be widespread, seeping into every corner of the product. Ubiquitous.
For instance, when Cloudflare all went down a while ago due to a bad regex, it took less than a hour to rollback the changes. Undoubtably there were bad practices that lead to a regex having the ability to take everything out, but the problem was isolatable and once adressed partial service was quickly restored, and shortly after preventative measures were employed. This bug didn't destroy cloudflare for months.
P.S. in anticipation of the "but cloudflare has SLAs!!" that isn't really a distinction worth making because M&S has an implicit SLA with their customers — they are losing 40 million each week they can't offer service. Plenty of non-b2b companies that invest in quick recovery as well, like Netflix's monkey testing.
No, best practice is that you have a checklist to bring up a copy of your system, better yet that checklist is "run a script". In the cloud age you ought to be able to bring a copy up in a new zone with a repeatable procedure.
Makes a big difference in developer quality of life and improves productivity right away. If you onboard a new dev you give them a checklist and they are up and running that day.
I had a coworker who taught me a lot about sysadmining, (social) networking, and vendor management. She told me that you'd better have your backup procedures tested. One time we were doing a software upgrade and I screwed up and dropped the Oracle database for a production system. She had a mirror in place so we had less than a minute of downtime.