"Amazon EC2, Microsoft Azure and Google Gloud Platform are all seriously screwin...

noahl · on June 6, 2017

This is true, but I think there's another point that's even more significant: not all network access is the same. In particular, putting some machines in a rack in a colo and connecting to the nearest Internet exchange is really not comparable to Google's networking, which is essentially a worldwide second Internet that is faster and more reliable than the normal Internet, and which carries your packets as far as possible towards their destination before dropping them off at a POP.

(Note: I worked for Google until 3 days ago.)

Xorlev · on June 6, 2017

Can't agree more. I balked at the cost (still sort of do as a GCP user) until I joined Google and was able to peek behind the curtain. The network quality is absurd. GCP customers pay a premium for having the same network reach as Google. Their network works extremely hard to get user traffic onto the internal network as early as possible.

Not only that, but Google's datacenters have 24/7 attention, lots of redundant providers, etc.. A bargain basement colo won't be nearly as reliable.

sitepodmatt · on June 6, 2017

>Their network works extremely hard to get user traffic onto the internal network as early as possible.

The benefit of this has can't be understated. In SE Asia connecting to their Taiwan region from most places I'm literally just at the mercy of a few local hops at level 3 due to their extensive remote peering. Staying on the Google network is also interesting to observe when connecting between Google Compute data centers.

AWS and Azure behave considerably different.

noahl · on June 6, 2017

Yes, you're right, and you mentioned something else that I forgot to. A lot of people here are comparing Big 3 network costs to the cost of hiring a single network administrator. The correct comparison is to the cost of hiring a team big enough to staff a 24-hour on call rotation with a 5 minute response time SLA.

morecoffee · on June 6, 2017

...and how many people need that? I would gladly 1/3 of the cost for 4 nines over 5 nines. (or whatever SLA they offer).

Not everyone need a ritzy ultra reliable network. It costs way too much and hardly any customer is going to get comparable value out of it.

joshuamorton · on June 6, 2017

Their business customers probably really do care. Companies run off of AWS and GCP. I'm sure that spotify and Snapchat and Netflix get considerable value from those guarantees.

noahl · on June 6, 2017

I agree! This is a real issue for cloud providers currently. You shouldn't have to pay for the super-reliable fast network if you don't want it.

nkristoffersen · on June 6, 2017

And people don't have to! People can self host instead :-)

slackingoff2017 · on June 6, 2017

I disagree strongly with this. Before all these cloud providers websites were not noticeably less reliable than they are now. I was around back when Apache and CGI was all there was, even then uptime was so good that it was rare to hit a website that was down.

There's a lot of koolaid being thrown around by the companies with cloud to sell. Unfortunately these also happen to be the big "market leaders" so it's hard to deny what they're saying and get taken seriously.

It's like six sigma, agile, or stack ranking. The big guys are doing it so it must be the right thing to do... Right? Until everyone realized it's mostly a ploy to make money selling books and conference tickets, or in this case rent out a bunch of excess capacity for huge profit.

I disagree with reliability as well. Most of my bare metal and colocated machines have uptime of many years. Most AWS VM's die after a year or two. With the cloud you have to worry a lot more about fault tolerance whereas with dedicated equipment simple offline backups are often enough to meet any reasonable SLA.

mbesto · on June 6, 2017

> Before all these cloud providers websites were not noticeably less reliable than they are now.

Before all of these providers existed:

1) The volume of internet traffic that exists today didn't exist then. Mobile devices didn't exist. Mobile devices that aren't always connected to the internet and consume hours of our day didn't exist. Large downloads (1GB) didn't exist. They couldn't exist because the infrastructure that exists now can properly support it...at scale.

2) Websites that had large volumes of traffic had top tier expensive admins to maintain them (surprise surprise, Amazon did and turned it into a service)

> Most of my bare metal and colocated machines have uptime of many years. Most AWS VM's die after a year or two.

3) What's your definition of "most"? Sources for those numbers please?

slackingoff2017 · on June 6, 2017

As we've gotten more device computers have gotten more powerful. Server software has gotten better. HTTP has Keepalive and multiplexing now. Encryption and networking are offloaded to hardware and we have a lot more cores.

I would guess a single server can handle thousands of times as many users as it could handle years ago. HAProxy, Netty, Nginx, and others can handle over a million (simple) HTTP requests per second. That's more requests than Google.com gets.

Most as in I've been watching over 100 AWS VM's and maybe 30 on Azure for years and they die or crash far more often than than the VM's hosted here, at colo, or our old bare metal machines. It's anecdotal but it seems like AWS doesn't really care about warning you before shutting off your machine. Azure is slightly better but still goes down regularly.

I know everyone says "it's okay! Just make your servers fault tolerant!". Well that works great for load balancers and frontend, but doesn't work at all for SQL databases. ACID compliant transactions require a single source of truth and a true multi master SQL database is impossible. Failover yes, but you always risk losing data in the switchover unless you use two phase commit which actually makes your multimaster database slower than a single system. In practice the failover almost always causes some data loss and log conflicts you have to diddle with later. And God help you if the replica falls behind more than a couple seconds.

Anyways, for SQL databases system reliability is as essential as ever and it's a lot easier to get high SLA numbers when you control the hardware and the power switch. The closest you can get to the Holy Grail is running KVM VMs locally and doing live machine migrations when hardware starts to fail, but even that won't keep your database running if something really bad happens.

garyclarke27 · on June 6, 2017

Thanks for the info very useful, I didn't realise how unreliable AWS is for database servers. You're correct random unannounced shutdowns of db servers is just not acceptable for critical data. I will be launching a new business based on Postgres soon and the thought of this is terrifying. I'm not keen on the RDS type services or CoLo so this is an unexpected problem I need to overcome. Do you know whether VPS provider such as Digital Ocean or CloudSigma would be more reliable?

slackingoff2017 · on June 7, 2017

I would say colocation is most reliable. I'm sure dedicated VPS is better but most still reserve the right to pull the plug for hardware replacements. Colo isn't terribly expensive if you buy used equipment.

Really consider how important 100% uptime is though. Google and S3 have gone down multiple times without killing the internet or losing a ton of customers. Plenty of large SaaS providers still use maintenance windows. Heck, GitHub went down today. Not sure if you use ADP but that goes down for a couple days a week!

I know it's not the popular thing to do but you can get much better relative reliability by running a single database per tenant and running a limited number of tenants per VM.

samstave · on June 6, 2017

Yeah, well my BBS runs just fine over 2400 baud and 640kb ram. Ill see your ass in the Pit.

Clubber · on June 6, 2017

Call waiting will take you down every time! :)

jcastro · on June 6, 2017

I think you're being a bit rose-colored glasses here if you're implying that apache/cgi sites had some sort of incredible uptime record back in the time when "the slashdot effect" was still a thing.

> Most of my bare metal and colocated machines have uptime of many years.

Nobody cares about host uptime, that just means you're not applying security updates. You're lamenting that you have to worry about fault tolerance in the cloud when the entire point is to have throw away instances!

scott_karana · on June 6, 2017

> Nobody cares about host uptime, that just means you're not applying security updates.

Um... Only kernel updates used to require a reboot, and production security-critical ones are fairly rare.

Solutions like ksplice, kexec, kpatch, or kgraft have existed since about 2011, and now the Linux kernel has first-class support for "no reboot" updates.

Spooky23 · on June 6, 2017

You're comparing the simplist possible scenario with a platform that can replace a large data center.

How do you handle segmentation of workload among thousands of servers? Provide firewall services? Meter bandwidth? Provide redundancy to protect against failure of a switch or any other single network component? Provide redundant transit via multiple ISPs?

The answer historically is that you would buy a bunch of stuff from Cisco, pay through the nose for TAC, and hire a team of network engineers who may or may not be idiots. One place I worked at lived with a 30 day eta in firewall changes.

Your server that has a long uptime is a risk in a large org. It's obviously not patched, nobody knows how to configure it again. Different customers have different needs.

slackingoff2017 · on June 6, 2017

You don't need anything but regular PC's these days... 2 front end IPVS servers running MAC layer direct return. Those can handle close to 40 gigabits each. If that's not fast enough hack something together with DPDK or netmap.

Firewall is on the front end load balancers. Linux is one of the best firewalls you can get if you configure it right.

Redundant L2 switches past that running RIPV2 or OSPF for routing. I've found that crappy consumer quality switches usually work fine, as long as you wire them in parallel. You can do dumb things like wire them quad redundant and it just works.

Redundant transit handled by datacenter level multihoming.

Extra redundancy using dual data centers if you want with DNS round robin.

Buy dedicated lines to avoid bandwidth costs.

You can rent a cage and do all this with used last gen dell workstations running Ubuntu. Get some last gen fibre NIC's off eBay too. Total cost of maybe 2k in hardware to run top 500 site levels of traffic...as long as you're not using something piss slow like PHP.

Historically how shitty your setup is depended on how good your network guys are, the cloud just took that out of the equation.

Now everybody can have a great setup as long as they pay out the nose.

Edit: Also you can patch Linux without reboots, even the kernel. I don't remember the last time I had to reboot from patching.

You do need to reboot for major version upgrades but if you stick with LTS versions you're usually good for 5-10 years

_lqaf · on June 6, 2017

> Disagree. There's no false advertising here

I don't see any claims about false advertising. Just complaints about one aspect of the service being very overpriced. Sounds like you consider it worth it for your application, and that's great. But it is simply not priced appropriately for a lot of services, which naturally go elsewhere.

It seems obvious to me that bandwidth is Azon's proxy measure for "utility". Lots of companies do this - Oracle uses core-count as a proxy for the "utility" you get from their software, for example. But like any proxy for a different variable, it is going to be wrong; sometimes a little, sometimes a lot.

And you hardly need to be a "BIG" company to choose something other than cloud hosting. My last two looked at and rejected them. Neither qualify as even medium-sized. Both are extremely cost-aware. Running our own data centers is cost-competitive at my current gig and was much, much cheaper at the last, which was very bandwidth-heavy.

amygdyl · on June 6, 2017

If every cloud service arguably overprices one essential part of delivery, bandwidth usage in this case, my instinct (no more, no less and no pretension to analysis) is that there's a design intent to ensure a difficult to price component of the service provision is paid for.

I'm merely speculating, but if you have a idea what is the possible difficult to price component, can anyone help with any suggestions?

Edit to suggest a similitude: my job is to help people understand the price of (print) advertising. This is often intractable. To a certain extent, a good deal of people involved just throw costs into commission rates and other factors that make understanding the price models difficult and sometimes completely opaque. Is it impossible to imagine that the cost of cloud bandwidth is nothing less innocent than "put our non itemized expenses here'?

MichaelGG · on June 6, 2017

This is simply untrue for bandwidth charges if you're pushing a lot of data.

A streaming video service, for example, might end up paying $0.10/user/hour. So a movie-a-night customer would cost $6. What multiple of reasonable do you think that is?

sokoloff · on June 6, 2017

About 2-3x. I'm a big fan of cloud computing in general (and AWS in particular as the clear leader, IMO) and helped lead our transition out of colos and into AWS, but the egress charges is the biggest area of AWS that I feel is egregiously over-priced.

Fortunately, it's relatively easy to locate some of your very high egress services outside of AWS. I discourage that for casual optimization, but when you get to Netflix/20 scale, that makes sense to trifle with.

abalone · on June 6, 2017

Speaking of Netflix:

“The best way to express it is that everything you see on Netflix up until the play button is on AWS, the actual video is delivered through our CDN,” Netflix spokesperson Joris Evers said.[1]

[1] http://www.networkworld.com/article/3037428/cloud-computing/...

mankash666 · on June 6, 2017

A good case study is Dropbox. They didn't build any of their own infrastructure until they became big enough (2 years ago).

Mahh · on June 6, 2017

The fact here isn't quite right. (I'm an employee at Dropbox)

Primarily, until 2 years ago we did lean on S3 for all block storage, but most of the rest of the infrastructure (metadata storage, etc) ran in our own datacenters.

Your point I think you're getting at sounds like something I'd agree with though -- you can wait a bit the cost efficiency starts to be what is important/impactful to work on before shifting your usage away from some of these providers.

late2part · on June 6, 2017

I heard a story once that Dropbox started to move their data out of S3 and AWS rate limited them so they couldn't.

I don't know if it's true or not but I heard the story.

Dobbs · on June 6, 2017

Dropbox built their own infrastructure over five years ago. They just took several years to turn the infra from idle to being used.

xadhominemx · on June 6, 2017

would a scale streaming customer really pay that much?

_ibu9 · on June 6, 2017

This is definitely true but I think that possibly as the big cloud players devote more and more of their revenue towards developing all kinds of (vendor lockin) APIs each of which does not concern the average user, rather than generic and reliable infrastructure that benefits all users, the value proposition will get worse and worse.

throw2016 · on June 6, 2017

This comes off as a bit of a red herring. No one is denying the value of their other offerings. It's the cost of bandwidth which is clearly disproportionate.

Let them charge for extra for the 'value adds', what is the justification for adding these costs to the bandwidth specifically?

I think people raising the issue are just seeking more transparent pricing.

WalterSear · on June 6, 2017

Considering my experience with google adwords service, I wouldn't be surprised if the teams providing these services were under heavy pressure to over-promise, under-deliver, and sneak in as many hidden charges as they possibly can.