> Once one scales beyond a single node the system becomes distributed. Then ...

benblack · on Oct 22, 2010

I purposefully ignored the tornado and so it did not hit my datacenter, tear off a section of the roof, kill all power sources, and drench my servers. Hard problems: solved.

sophacles · on Oct 22, 2010

Thanks! My app was in that datacenter too. I mean, it had replicated mongodb instances, and well balanced app servers, and nodes going away have no discernable affect on users. Turns out tho, that with all that distributed engineering, I didn't find out that the hosting company doesn't put your nodes in various data-centers. That tornado would have taken down my service during peak hours.

I know you will try to write that off as a "you get what you diserve" but I challenge you to go ask people if their apps would survive a tornado to the data-center. Many of them will say "sure its in the cloud!" Then drop the killer question on them... "How many different data centers are your nodes running on right now". Most will say "i dont know". Some will say "My host has many data centers" (note this doesn't answer the question). A few will actually have done the footwork.

Also, the scenario you describe is as easily mitigated with hot failovers and offsite backups. This probably qualifies as distributed engineering, but only is only the same as the above discussions in the most pedantic senses.

benblack · on Oct 22, 2010

As MongoDB did not exist at the time, it seems unlikely. Such things happen more than we might like, of course!

"Also, the scenario you describe is as easily mitigated with hot failovers and offsite backups."

This is a sadly wrong, though common, belief. There is exactly one way to know that a component in your infrastructure is working: you are using it. There is no such thing as a "hot failover": there are powered on machines you hope will work when you need them. Off-site backups? Definitely a good idea. Ever tested restore of a petabyte?

Here's a simple test. If you believe either of the following are true, with extremely high probability you have never done large-scale operations:

1) There exists a simple solution to a distributed systems problem.

2) "Failover", "standby", etc. infrastructure can be relied upon to work as expected.

Extreme suffering for extended periods burns away any trust one might ever have had in either of those two notions.

sophacles · on Oct 22, 2010

First and foremost, the opposite of simple is not hard, it's complicated. There is not a correlation between the simple-complicated spectrum and easy-hard spectrum.

If you don't regularly press the button swapping your hot failovers and live systems, you don't have hot failovers, you have cargo-cult redundancy. It's like taking backups and not testing them. If you don't test them, you have cargo cult sysadmining.

Distributed is complicated, but not that hard, there are well understood patterns in most cases. Sure, the details are a bit of a pain to get right, but the same is true of alot of programming. I have done distributed systems. Maybe not google scale, but bigger than a lot, hundreds of nodes geographically distributed all over the place, and each of these nodes was itself a distributed system. I've dealt with thundering herds (and other statistical clustering issues), can't-happen bottlenecks, and plenty more. But each and every one of these problems had a solution waiting for me on google. Further, each and every one of these problems was instantly graspable.

A lot of distributed stuff just isn't that hard. Sure, things like MPI and distributed agents/algorithms can get pretty hard, but this isn't the same as multi-node programs, which isn't the same as redundancy.

Keep the smug attitude, I'm sure it helps you convince your clients that you are "really good" or some crap.

gruseom · on Oct 23, 2010

You're talking to the guy who invented EC2. I think he knows what's hard.

http://blog.layer8.net/ec2_origins.html

sophacles · on Oct 23, 2010

Im not intimidated. If anything, he should be more far, far more understanding of the point then: A simple fail-over model for a smallish database is completely different than a full blown EC2. In fact given that EC2 is a freakishly large example of distributed system, I would put it in a totally different class of product than the average "big distributed system".

ieure · on Oct 22, 2010

Cool story, bro.

moonpolysoft · on Oct 22, 2010

Sharding puts you in the same risk category as any other distributed database. Just ask the engineers and ops people at twitter, foursquare, and any number of other web companies that have dealt with sharded databases at scale. Also: sharding is eventually consistent, except you don't get any distributed counter primitives to help figure out which replica is telling the truth.

jhugg · on Oct 22, 2010

There's a difference between what ops people see and what developers see. If the ops people have the same headache no matter what, why force the developers to think about consistency?

benblack · on Oct 22, 2010

Good question! The answer is that the "If..." evaluates to false: ops (and by that I hope you mean something more than facilities staff) has a far simpler set of problems with which to deal when the software is designed, implemented, and tested in accordance with physical constraints. I, too, would like a gold-plated, unicorn pony that can fly faster than light, but in the mean time, I'm writing and using software that produces fewer operational and business headaches. Some of that includes eventually consistent databases.

jhugg · on Oct 22, 2010

I totally concede that most consistent databases at scale are crap on the ops people.

I would echo Mike's argument and claim that a nice ops story and true consistency are not totally orthogonal. There are a lot of practical reasons why the big database vendors have boiled their frogs and ended up where they are today.

VoltDB certainly isn't the realization of that dream, so I'll put my money where my mouth is and spend the next few years helping it get there.