Hacker News new | past | comments | ask | show | jobs | submit login
Compute Engine machine types with up to 96 vCPUs and 624GB of memory (googleblog.com)
234 points by ramshanker on Oct 6, 2017 | hide | past | favorite | 167 comments



Why would you use that instead of bare metal? You'll come out >$4k per month with sustainable use, for that price you get the same power in bare metal?

Preemptible is a bit cheaper but are 600GB memory really worth it for short running applications? Until you loaded everything in memory your machine probably gets destroyed..

EDIT: Not sure about the exact CPU performance but should be quite close to what OVH offers here? With the same memory configuration and 2TB NVMe this still costs <$1,500/month (https://www.ovh.co.uk/dedicated_servers/hg/180bhg1.xml).


You can find a cheaper bare metal system with comparable performance for every instance type. That is not a new argument for high vCPU/Memory instances.

If (monthly) price is the main concern "the cloud" is probably not for you. Other people obviously massively value the benefits of it and pay the premium, and that is not suddenly going to stop for a new instance size.


In addition to that, those instances might be deployed for short time were much power at once is needed and powered off after that. In that case the cloud might even be cheaper as Google offers pricing by Minute (or was it second now?) whereas most bare metal providers bill by the month.


Per second, with a one minute minimum [1].

[1] https://cloudplatform.googleblog.com/2017/09/extending-per-s...


"Most bare metal providers bill by the month." Maybe you mean dedicated hosting providers?

Any provider letting you spin up bare-metal through an API almost certainly bills at much finer grain than monthly, although they may quote monthly prices to make it easier for customers to assess cost.


There is overhead for space, reliability, networking, and maintenance for bare metal. If you have an existing datacenter the marginal costs are tiny, but if you are only in the cloud or not planning to need these beasts for years on end, the net costs are much lower to just harness the economies of scale provided by GCP, AWS, et al. As for OVH, the amount you’d save in networking between that box and the rest of your infrastructure by having everything on one cloud provider probably pays for the difference.

Add in flexible network storage options and integration with existing security infrastructure. You are thinking too small by comparing a single box to a single box.


You are thinking too big by talking about having an existing datacenter. Many companies colocate anywhere from 1U to a private room with many racks in an shared datacenter.


And most people that have a tiny part of a shared datacenter still call it their datacenter.


I used a GCE to test some image processing software I wrote a while ago (it runs on a very large dataset). I configured a 64 core machine with 128gb of memory. It ran perfectly, although it cost about $200 to run the test for a day.

Sure, it wasn't the highest performance per CPU, but I didn't have to buy the bare metal, I can scale up the number of cores if need be, and I can fire one up whenever I want one.


You do realize that it was not actually a 64 core machine?


I wonder why you were downvoted. A 64 core machine would have 128 vCPUs


Not necessarily. Depends on the CPU architecture and whether hardware threading is enabled.


In the context of GCE they're documented specifically as a single hyperthread


For the price of 2-3 days you should be able to get a dedicated OVH server for a month. The 64 GCP cores are 32 real cores, so monthly rent of $600 should get you there.


For one reason, if you only need the one system, you almost certainly need to compare the cost to owning and maintaining two.

If you can't tolerate the server being down for more than the lead time it takes to get a new one, you need to already have one on standby. The lead time is probably at least a couple weeks, but there's no guarantees since you're depending on vendor availability and hundreds of other things out of your control.

Depending on what you're running on it, you'll also probably want to test software upgrades and have fallback plans when you deploy.

The thing the "cloud" version gets you is zero lead time, along with the ability to spin up a second instance (or ten, if you want) while you deploy a new version or just want to do some testing.


Maybe you need the capacity only for some hours here there?

If all your data and other processing is in cloud A, moving some processing to B might not be feasible (moving lots of data takes time, security requirements may complicate setup)


One of the biggest data-lockin factors is network I/O costs. These have been kept artificially high by all cloud providers and act both to deter import/export and also to subsidize other functionality.


Yeah, you're not kidding.

I run semi-bandwidth intensive applications and DigitalOcean and LightSail are actually better deals than EC2 for the amount of bandwidth. $5/mo for 1 TB on DO/LS vs $90 for 1 TB on EC2.

We use a mix of dedicated hardware and DO/LS to meet our needs as bandwidth on the major cloud providers was just too expensive.


There are providers that offer unlimited bandwidth that HN users have tested: https://news.ycombinator.com/item?id=14247795

> notamy: I have an application on OVH (on the USD $3.50/month plan) that pushes/pulls >10TB/month

That entire discussion is recommended for anyone looking for a cheap VPS.


Yeah, OVH is great, I highly recommend them, we use their cheap dedicated servers via Kimsufi and SoYouStart at a few locations.

I seem to recall a friend having stability issues with their VPSes several years ago, so we stuck to the dedicated stuff from them, but it's been extremely good especially considering the price. Have you had good stability with their VPS services?


Unfortunately all I can do right now is point to the anecdata of others as in the link above and additional pointers within that project[1][2]. If you have time and can share any more details on your experience I would tremendously appreciate it.

I am debating starting a Twitch -> YouTube video stream duplicator/archiver that would initially make money by auctioning available capacity with the long-term goal of being aquired by Twitch since their integration is so unreliable.

[1] OVH 10TB traffic throttling https://github.com/joedicastro/vps-comparison/pull/25

[2] Scaleway truly unlimited https://github.com/joedicastro/vps-comparison/issues/9#issue...


I'm doing a similar (but different) kind of thing going from Twitch to YouTube. At the moment, Google Cloud Platform doesn't charge for bandwidth egress to their own services, so YouTube uploads are free if you use GCP.


Thanks for the specific tip! Link for the lazy: http://gamebot.gg/ A Show HN would probably do well (with a bit of behind-the-scenes in the comment), as would in-depth blog posts if you're doing machine learning.

I actually tried the Hearthstone one and the very first clip in the current example 'Greatest Clips' (BJwDyxrplpo) appeared to miss the actual action (clicking "Disenchant" - which may actually have been the point since it could have been just a tease) but the rest of the clips seemed complete (and interesting).

I've thought about stream-jumping/recording based on simple indicators like increases in chat comments, viewers, followers, etc. How much of this could be built off Twitch's own 'clip' functionality (whether initiating them yourself or aggregating the manual curation of others -- neither of which AFAIK has an API right now) and collecting them later? Separate note I'm trying to hide in this pararagraph: don't overfit if you want to apply this tech to other streaming sites where real money is flowing (aka NSFW).

Personally I don't care so much about specific games on Twitch (except Street Fighter, which gets relatively little love but your videos are a real time-saver) instead of personalities. It might be worth offering this service to them, focused solely on collecting their highlights. had a tough time with the non-English streams but not sure what options you have there. I'm also interested to see how this will turn out for you using Twitch content if they notice that what you're doing is catching on. Twitch seems to be leaving a lot of low-hanging fruit behind for othes to capitalize on.

Feature wise: more playlists, maybe monthly and/or collecting the highlights of the highlights, with most comments/views/thumbs-up on previous YouTube videos. If there was a way to incorporate chat since most streamers don't in their videos, you should. https://github.com/PetterKraabol/Twitch-Chat-Downloader

Bug-wise, it seems like something is going wrong with the links as the end of this video, appx. 30 seconds of moving images but no links in Firefox with ad block [disabled as legacy]. (8ql3id1lJoM, ilkKuvuna10)


Ahh you found it! No machine learning at the moment, just using the Twitch clips api. Machine learning would be very helpful for some problems however, specifically to weed out clips in which the broadcaster specified the incorrect game.

You're right about offering the service to the streamers - that's definitely the way to go to make a business out of it and it's something I've considered. However, I was mostly interested in doing the project for fun, and for some passive income, and making it a service would definitely not be passive.

The clips API returns language information about the clips, which you can use to filter them. Before that, I had to manually maintain a blacklist of non-English streamers.

I do monthly highlight videos, but they're solely based on clip views on Twitch - it doesn't use YouTube analytics, which I'm sure would improve the videos.

It is a cool idea to include chat - another thing I've considered but haven't implemented, though I've noticed some Twitch highlight channels (that do manually edited videos) do it. Thanks for the link to the downloader.

The links at the end of the videos are tricky - there's no api for that, so currently they're populated by a Firefox macro on a desktop that's supposed to be run every day - looks like there's an issue with it running! The better version would be to use a webscraper or headless browser to automate those clicks via the render server. That's what I'm supposed to be working on next, in fact...


YouTube automation seems like a relatively untapped (if niche) market.


These are still 4 to 5 figure prices. Larger companies really don't care about these tiny fees, especially compared to the licensing costs of the software running on these servers. It gets the job done faster and easier, so it's worth it, especially when everything else might already be in Google Cloud.


Here's a server with 512GB RAM, 40 cores for $1,850:

http://www.ebay.com/itm/122593732313


Cool. So, where do you put that server? does it have sufficient power, AC, generators, UPS? How much does that cost per month?

Who monitors that PC, and does preventative maintenance, etc? If a part looks like its going to fail, where do you migrate your workload to, so you can take that server offline for repairs. (you need multiple servers)

Since you have multiple servers, how much does your 40GB networking cost (with 100GB uplinks) so you aren't constrained by the network? And what kind of storage network do you have, so that you can live-migrate these running machines around to different machines?

Lastly, if you co-locate the server somewhere, what does it cost for multiple redundant internet connections to the facility? And where is your failover facility, that is at least a few hundred miles away?


I think this is a very important point and one that many folks don't fully appreciate:

The total cost of ownership of a computing asset is several times greater than the cost of the actual asset.

Think of it this way: a dog can be obtained for a very nominal cost (or free) but the cost to house, feed, entertain, and provide healthcare for is non-trivial.

it's not unheard of for just the costs of deploying a new device into a large organization to be something like EIGHT TIMES what the cost is for the actual asset. That's just to get the hardware deployed and NOT the cost to keep it running.

Cutting down on TCO and streamlining the deployment of resources is a big part of the sell for cloud deployments. Particularly for computing assets that may otherwise spend a lot of their time idle.


Right and all the bare metal providers like softlayer are probono orgs


Softlayer's network is crap compared to google and aws. That being said in my previous company we used a combination of Softlayer (for dedicated machines) and AWS for cloud. There is definitely a use case for each, but as Nrsolis mentions there is additional cost in things beyond the cost of the initial hardware itself.


Not a Softlayer customer but hard to imagine network crappier than AWS.


At least with AWS you get placement groups, which can help a lot. With Softlayer we saw entirely too much packet loss on a regular basis, and they try to upsell you on things to "fix" it.


default port speed was 100Mbps as of 10 months ago


It's also a very old CPU generation, and slower memory.

They're great boxes for cheap on-prem etl clusters.

I prefer sell R810s over those hps because they're 2U and have better power consumption with the same specs.


Thank you, I was looking for this comment. That is the value of the cloud: the reduction in TCO, the predictable pricing, up-time guarantees, and bandwidth availability.


...ok? As I just said, we (as a company) dont care about a few thousand per month in exchange for letting Google Cloud handle everything for us. We're definitely not interested in buying some used server from ebay and then figuring out where to run it.


How old is that server? I do not see ECC RAM if you care... also are the other pieces of the server going to fail anytime soon?

How loud is it? How much energy does it consume while running? How hard is it to configure and keep running? What kind of firmware does it have and will it be a problem updating?

These are all the questions I would have before buying a beast like that...


People tend to ignore even quite a big costs if they are spending company's money, not their own :)

Also, this machine you link to might cost $1,850 but:

- it is used, not new, so it can break any minute, and you don't have any warranty.

- on GCP just $2,100 per month buys you similary speced machine AND peace of mind.

- running this machine 24/7 poses some significant electricity cost.

EDIT: - $2,100 is probably lower than a salary you would pay an IT technician maintaining your machine(s).


The whole “electricity costs a lot” argument is getting tiresome. Where I live, electricity is 11 cents per KWh. That means even if you run that machine full tilt 24x7 (which you won’t), and even if it draws 1KW (which it won’t), it’s still only $79/mo in electricity cost.


> electricity costs a lot

well it's not just electricity. to run your server you probably at least need a network, ups and probably other things. this stuff especially gets ugly when you want to have a network with many servers. at least most dedicated server providers charge a ton of money for interconnection of servers. well at least ova actually provides a vRack for dedicated servers but its not always free.


Power in a lot of places is 2-3x that, and for a naively designed data centre you can double that to include the cost for air conditioning (yes you can do a lot better but on a small scale, basic AC is going to equal your workload draw).


Data centers aren’t built in such places. And for just a few machines a simple building HVAC will do fine.


then you would need multiple of them (for redundancy) if you use it in production environment, then goes addirional maintenance of hard drives and so on. so what would be end prive for bare bones?


What if you don't have sustained usage? What if you are developing software for scientific data processing, and you usually work on small data sets for testing, and once in a while you have huge computing needs?


>What if you don't have sustained usage?

Well, that's where cloud servers are great.

You'll need to break out Excel and calculate whether it's worth it with regards to usage.


Isn't it what beowulf clusters are for?

It's never cheaper in the cloud.


No. Scientific workloads rarely scale well when having to do a lot of communication over a commodity network. You're also assuming people would rather have a bunch of machines lying around that they had to pay upfront for than just paying for an occasional single instance? You're oversimplifying things if you literally think it never makes financial sense to use the cloud. It's the same as saying it's never cheaper to have health insurance. Objectively, that has to be true on average, yeah, but then why do so many people buy health insurance? You're managing complex risk at the expense of overhead. Even when it sucks, it's not feasible for everyone to keep enough cash laying around for when they get hit by a bus.


We used a config not quite this big for a Nominatium database rebuild. It takes weeks on an underpowered server, but hours (or a day?) on something with enough resources.

Once rebuilt, using the database is fine on a normal server.


If your data is in Google/AWS cloud, it's expensive to process it outside Google/AWS.


There are quite a few problems where you need a lot of memory and CPU performance for just a relatively short amount of time like a few hours per day or even just a few hours per week. Forecasting or complex optimization problems for example.

In these cases the amount of money you spend on hardware virtual or otherwise is negligible. Depending on what you do it might just as well be a rounding error.


Not only that but due to the usual NUMA mismatch, additional page tables, iommu, poor storage connectivity/sharing, etc between the bare metal and the VM, the VM is likely losing a significant chunk of perf vs the bare metal.

Frankly, I have a hard time understanding why the convenience of being able to call an API to get a VM (vs using an API to get bare metal) continues to be an advantage. I am reminded of the reddit articles about all the effort they went through to re-optimize their app (by batching queries) for the longer database latencies at AWS... Its like they never considered all that work might also apply to bare metal and save them even more money...


The problem with this isn't the price for the computing power / memory, the killer are the traffic costs in the cloud that are going to bankrupt you before this thing is even at half utilization.


> >$4k per month with sustainable use, for that price you get the same power in bare metal?

not so sure about that + a hoster who provides such a machine as bare metal wants a setup fee, needs time to setup and a minimum contract duration much longer than one month

guess there are not many hosters which have such a beast as bare metal in stock and available in few minutes (are they any hosters at all?); they will order sch machine themselves and you will wait at least a week


You could order 16 of these https://www.hetzner.com/dedicated-rootserver/px121-ssd

Minimum contract length: 1 month, total cost (including setup): $4,384 (on-going month to month cost thereafter: ~$2,191.20).

For that you'd get an aggregate total of 4TB RAM, 7.6TB HDD (SSD), 96 real Intel E5-1650 v3 cores (or 192 vCPUs) and 800TB of bandwidth.

Sprinkle with terraform/ansible/k8s/docker and you have a resilient, massively powerful compute cloud with no long term obligation that's about half the price of GCE if you keep it around beyond 30 days. Or another way to look at it: if you needed such a platform for two years, your second year would be free compared to GCE.

One major issue with this approach (versus GCE's "all in one" box) could be network performance bottlenecks depending on what task(s) you were using such a cluster for.


Bare metal 48 ht cores for $1000/mo: https://www.packet.net/bare-metal/servers/type-2-virtualizat...

So we've established that aws is far more expensive now since the tco is taken into account in both cases.


There's IBM/Softlayer for that. Their hourly bare metal offering goes up to just 256 GB of RAM, though. More than that and you'll have to do a monthly commitment.


> https://www.ovh.co.uk/dedicated_servers/hg/180bhg1.xml

Yeah, the main reason to go with Amazon in this case is if you only need the box for a few hours (ie: you're doing data science or similar). For long term high load use, bare metal, even managed bare metal like that is almost always cheaper.


Everyone jumps to OVH for dedicated server price comparisons, but what about pricing for similar US coastal datacenter locations? Are there even US dedicated server providers with pricing lower than “call us?”

I totally believe that OVH is cheapest for European companies serving European customers, but that’s apples to oranges when we’re talking about American cloud providers.


We host at IBM/SoftLayer and our bare metal pricing comes in at around 33% of these high mem GCE instances.


OVH Montreal is 8ms away rtt from NYC


>"Why would you use that instead of bare metal?"

I would think lead time. To get the quote from your hardware vendor, get the PO approved from finance, get the box built and shipped, getting it racked and stacked in DC. This whole sequence can easily take longer than a month.


You can rent a bare metal server from Codero and get it provisioned in about an hour. Prices from about $100 to $1000 a month. At the high end, you get roughly what Google is offering here.


I had a big simulation to run that required lots of memory and lots of cores. I rented a machine for some 10 hours that it took and happily paid the 30-40 dollars that were charged.

those machines have a lot of value for specific workloads.


There's the high-cpu variant with only 86.4 GB of RAM


But that one costs 6.4x the 16 core variant for 6x the cores. Are there any applications where you're heavily dependent on having all cores in one machine?


Anything where the workers have to synchronize their work with each other often. Having everybody talking to each other over the network quickly kills performance.


Yes.

XGBoost will use all the cores you throw at it, and despite the recent work on GPU versions most of the time CPU cores are best.

I'll commonly run a model for 48 hours on an i7. I'd love to be able to try more models.


Interesting that usually GCP is one upping AWS on some metric but in this case it isn't touching the current largest compute/memory instance on EC2, the X1 family with up to 128 vCPUs and 4TB of memory. Though the blog does allude to them testing such types in a closed beta, it still is a game of catch up still.


Yikes, I'm a little out of the loop. I didn't realize you could get 4TB of RAM in a single machine on EC2.

I've been seeing that medium data keeps getting bigger (i.e. the features of traditional RDBMS are eating away at the need for specialized / distributed stores for data analysis). But so too does it appear that small data is getting a lot bigger too—just load that dataset into memory for analysis. 4TB of memory allows for pretty big "small data."

"I remember back when we used to do gradient descent to estimate linear models; back in the long ago when we didn't have 900 exabytes of memory attached to our NVidia Matrix Crusher 9000 linear algebra accelerator unit."


Is it possible that data isn't getting bigger - but that the people who work with it just want to process larger data sets than before?

I mean before they'd train a model of 1,000 inputs and then test it against another 50 and call it a day. Now they want to train it against 1,000,000 inputs.

Am I completely off base? It's not my area, though I work with databases, my observation is that developers always want to use the most data possible even when it doesn't really provide any benefit.


Sorry, I was being a bit playful with language. What I mean is that, if you roughly define small, medium, and large data in terms of the strategies required to process, then the absolute size of the data that can be processed using simpler methods grows.

And whether or not more data is needed or collectible varies by discipline. Astrophysics collects way more data than they used to because 1. they need it. 2. instrumentation allows it.

Some kinds of data collection hasn't scaled up however. Surveying humans is expensive and labor intensive. And for many things that you might want to study about humans, you can't simply afix a sensor to them. So, what might have been only accomplished through big data, or medium data methods a few years ago can now be loaded into memory (i.e. small data strategies).


That is my experience recently. Developers storing 500GB on a database (pre-launch), with < 1GB of meaningful data. A bunch of json logs that they knew data science would want eventually, but couldn't be bothered to either pare down or put in a more sensible place.

The thing is, it didn't really matter; Postgres still had a ton of performance left over even after the product went live. If you can still fit it in RAM, why waste $$$ of dev time over the $$ cost of a bigger instance.


It’s also unsurprising that this thread isn’t flooded with “disclosure: I work at google” posts.

When GCP announces a feature that’s 1% better than AWS their employees flood the HN post, but when this story hits no one shows up.


Could be useful if you wrote your neural net with Applescript, I guess.


I agree. Once we passed the 640k mark it all became an excuse for lazy tool development.


Or if you use sap Hana, or Microsoft SQL server, or R, or Oracle, or anything that you pay "per instance" licensing on.

There are many applications where a single addressable memory space are still required/preferred.

Look into "tidalscale" if you want too see the logical extension for this.


If a VM with 96 cpus is being offered, how many cpus does the bare metal behind the VM have? What is the hardware like here?


I imagine you get that 96 if you stick together 4x Xeon CPUs with 24 cores each onto one server.

https://www.intel.com/content/www/us/en/products/processors/...


You only need 2 processors, since each cores gives you 2 vCPUs. "For the n1 series of machine types, a virtual CPU is implemented as a single hardware hyper-thread" --https://cloud.google.com/compute/docs/machine-types


so the bare metal might still have 4, such that it could host 2 of these VMs at once I guess.


There aren't quad socket Skylake yet :). You can also see that the largest public single package is 28 cores (56 threads). I'll sadly have to let you figure out how many because silly.

Disclosure: I work on Google Cloud.


Quad socket 24 core would fit. Skylake has up to 28cores and 8 socket support.


Likely quad socket 24 core with 768GB of memory to accommodate 2 regular instances or 1 memory + 1 compute instance.


I honestly wonder why did Google cite "SAP HANA" as the _reference_ workload for such setups. I never heard of that product. Noobie asking here, is it the _reference_ workload for such setups ? And are there more demanding workloads ?


If you can afford SAP you can afford these instances. I think it's a form of signalling to enterprises that the cloud offerings are suitable for them.

HPC people are much technical and less conservative, in general, and can sort themselves out.


I guess because SAP HANA is an in memory database which requires, well, a lot of memory. Other providers use HANA as a reference as well.


I assume that it's one of very few in-memory databases that are a) in actual production use and b) not already trivially distributed/distributable

I.e. just like Azul Systems' system, it's for running bad enterprise applications that you cannot scale with regular means.


AWS do the same: https://aws.amazon.com/ec2/instance-types/x1/

Interestingly, AWS have X1 and X1E-class EC2 offerings, two of which are significantly larger than this new Google offering.


But only Haswell, where GCP offers Skylake


So -80 % memory for 0-5 % more compute throughput?


...but curious to see the performance gain on SAP (or similar) workloads with a new CPU architecture. RAM is the bottleneck typically.


I think their latest instances were 4 socket Broadwell.

Disclosure: I work on Google Cloud.


AWS used it as their reference workload for similarly sized boxes too. SAP HANA is a huge deal in certain sectors. I assume there were plenty of clients they couldn't land without support for it.


SAP HANA is an in-memory database. It's probably one of the more recognizable workloads to require significant amounts of ram.


Maybe because SAP HANA requires vertical scaling. For horizontal scaling, I don't see a point in using such big instances.


On Google, the pricing is by cpu/ram so 1x 96 vcpu instance = 12x 8 vcpu instances in cost. Often easier to just run less of more powerful instances.


Unless you are single process bound, say mysql or postgres instance.


Sure, legacy software is still catching up to multiple cores. Postgres is much better now or you can use SQL Server instead. None of them are native multi-master anyway so scale up is still better than scale out unless you only need reads.


I am not sure it is correct to call Postgres or MySql "legacy software"

To me, legacy software would be something like using Novel PeopleManager 2002 or WordPerfect 7.

Postgres 10.0 is < a week old, and is a perfectly fine path to go for a brand new application. You are fooling yourself if you think you should just throw every new application in MongoDB or Cassandra because they are "web scale".

If you are using "Legacy Software" to mean "software that has existed for a long time", then I guess sure. But there are many pieces of software that are very new, and could benefit from a single instance with a lot of cores.

The most common use of "Legacy Software" is "Old crusty stuff which needs to be replaced", which is NOT at all the case for MySql or Postgres.


It seems you've misunderstood my comments. Perhaps read them again? I'm saying that it's easier to have fewer more powerful servers than many more smaller ones, but even then some software designed for single-node operations still has problems utilizing all the cores (like postgres until recently). I never mentioned mongodb or cassandra.


And that was only for single query to use multiple cores (olap). For oltp multi-core is no problem.


If you don't care about inter node transactions you can use citus for postgresql scaling.



Yes, Citus is good stuff. We use MemSQL though as it's a better fit for our (data warehouse) needs. There's also TimescaleDB, PipelineDB, CockroachDB, and more, lots of interesting options today for distributed "newsql" databases.


You can scale HANA vertically and horizontally. Vertical makes sense for ERP workloads where you want low latency (avoiding communication). The ERP workloads of most companies can fit in RAM easily nowadays. Horizontal scaling is best for data warehousing/analytical workloads, which tend to be more CPU bound due to lots of calculations.


Horizonal scaling costs more in base licensing, not to mention whatever 3rd party stuff you're running in your Hana. Enough to encourage scaling up as far as you can before scaling out.


> You can scale HANA vertically and horizontally

Nah, with scaling, I‘m more referring to approaches of eg Cassandra instead of „add another read replica“.


I wasn't talking about replicas either. HANA can distribute data over many clusters.

https://blogs.saphana.com/2014/12/10/sap-hana-scale-scale-ha...

HANA's roots lie in BWA (SAP BW accelerator), whose main game was distributing large amounts of data across commodity clusters.


I would honestly like to hear of another reference workload for machines of this type. I can't find very many uses other than databases for that much memory.

I think you can store every single uncompressed frame of a bluray movie in memory of an AWS X1... but even then, so what?


Did they increase the per-instance network bandwidth caps? Previously instances had 2Gb/s/core of network bandwidth capped at 16Gb/s, which makes 8-core nodes the sweet spot for network bandwidth.

I never quite understood why this is necessary if they're cutting up larger machines. Why should the total network cap matter if I have two 8-core instances or one 16-core instance on the same physical machine?

edit:

https://cloud.google.com/docs/compare/data-centers/networkin...

"The egress traffic from a given VM instance is subject to maximum network egress throughput caps. These caps are dependent on the number of cores that the VM instance has. Each core is subject to a 2 Gbps cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each instance. The actual performance you experience will vary depending on your workload. All caps are meant as maximum possible performance, and not sustained performance."


Pretty simple to understand why, they're "guaranteeing" that bandwidth. The hosts have a finite amount of network bandwidth available, they also have a finite amount of cores.

Judging by the ratio it's likely something such as 2x 56gbe network ports on the host (=112Gbps), which has 56 vcpu (2x xeon 14c/28T)


Right. So that's 2Gb/s for every core so why top out at 16Gb/s? I get the per core limit not the instance limit.


I doubt the average mysql server needs more than 16gb of bandwidth.


... let me show you the queries the developers are writing ^^


So you need to fix the developers not upgrade your network cards :)


I don't even know how to reply to this. MySQL isn't the only cloud workload.


Can you really spin one of these up on demand? That obviously means that they have machines of that size (or greater) sitting idle, waiting for someone to use them. That's mind-boggling.


They can still divide them up for the smaller sizes preemptible and destroy those once they get a request in. That way utilisation should be quite good. Yield probably not as high (since hardware costs will be higher than for multiple small machines) but the yield of renting out the highest variant sometimes will be worth it.


Yes, I configured a 64 cpu machine in a matter of minutes, ran a test on it for 24 hours, then shut it down and deleted the instance. Total cost was around $200.


*vCPU.

But yes, they probably strive to keep below 10% vacancy on all hardware.


GCP has live migration. Maybe they make space if it doesn't exist.


Hard to make hardware appear out of nowhere.

They only use live migration to do host maintenance iirc.


There are lots of reasons to shift stuff around, and there’s lots of hardware :) -googler


Idle, or running their own internal computations, or running lower preemptiable stuff.


Preemptible VMs exist in part to avoid leaving capacity sitting idle.

We hate idle resources. :)


Intel pages says these 28-core hyperthreaded processors work for 8+ sockets. Let's see...8 sockets * 56 virtual processors per socket = 448 virtual processors potentially in one VM.

https://www.intel.com/content/www/us/en/products/processors/...


Does someone have experience using this (or other) cloud solutions in a scientific HPC context? The only obvious disadvantage I can think of is that every student running something incurs a certain cost whereas after one has bought the actual computers, running jobs is somewhat free.


Depends heavily on the projected utilization. If you know your compute node is going to be computing for the next 3 years with at least medium utilization, then the self hosted metal is probably going to be quite a bit cheaper.

Its amazing how much hardware you can pack into a single machine for 10k€. Last year our group bought two additional high-memory (768GB) nodes for around that price each (including support for a couple years from the vendor).

A few years before we bought 40 nodes with 128GB RAM each, for a similar price to last years high-memory nodes (and a fast interconnect and a lot of storage).

If you are at a larger research institution, you probably also have an IT department that can co-locate your hardware for next to nothing (compared to cloud). There you also will save a lot of ingress/egress, storage, backup, etc. costs.

Regarding the per student costs, even with cloud instances I would consider running a traditional HPC job system (grid engine, lsf, torque, ...). The MIT had a nice solution with Starcluster [1] to easily deploy a SGE on AWS. It looks a bit dead now though.

[1] http://star.mit.edu/cluster/


> Depends heavily on the projected utilization. If you know your compute node is going to be computing for the next 3 years with at least medium utilization, then the self hosted metal is probably going to be quite a bit cheaper.

Isn't that already the case for 1 month? Bare metal doesn't mean own data centre or colocation. If you go with a hosting provider most offer dedicated hardware on a monthly contract. As long as you need them longer than 1-2 months that should be significantly cheaper than Google/AWS.


That's usually the case for almost everything on AWS/Google. If you're using them for specific features, or for very bursty work (e.g. if you use the instances less than about 6-8 hours a day), they can be cost effective, but the moment you use instances full time and don't leverage/depend on a ton of extra services, you're paying way above the odds.


Kubernetes helps this a bit with bin packing. Much easier to keep 3 32 core servers loaded than 32 4 core servers.


But that's the case whether you're using AWS or self-hosted, so it doesn't really alter that calculation much


The biggest downside (other than cost) I've found is that each vCPU core is quite a bit slower that what you'll get on equivalent real hardware. So any code that doesn't scale more or less linearly across an arbitrary number of cores will suffer.


vCPU's are hyperthreads and Intel packs 2 of those per core so you're in actuality getting half the number of full cores.


Isn't the cloud amazing ?


I couldn't see much info about how NUMA is handled


It's not exposed through their virtualization currently. I suspect it has something to do with making live migrations a lot harder.


Maybe someone can rent one for one minute and run lstopo.


With that many vCPUs I guess it becomes important to talk about things like how the memory hierarchy is wired.

I'm surprised no benchmarks are mentioned.


I wonder who will use these types of instances?


scientific computing and simulation. I have had need for machines like this - for a project that was very compute heavy but embarrassingly parallel, but also was very "chatty" - I.e updated the data structures a lot during compute

I can't be too specific about it, but if involved creating a very large tree structure and updating, pruning and transversing the tree a lot

If the algorithm is updating and reading a large data structure a lot it's only practical from a speed point of view to hold the whole structure in RAM


Is this really a better solution than just getting hours on NERSC (or some other government supercomputer)?


Private companies want to do simulations as well, and with this type of solution, you can pretty much run them on demand rather than having to wait in line.


...I guess.

My advisor's company straddles the public / private divide, but we've definitely done some simulations for private clients on NERSC, and I assume we weren't misusing hours allocated for some other purpose.


That's interesting. What tools do you use to run your computations in parallel on a cluster? Hadoop?


For this type of problem - it must be done on a single node. The overhead of network communication would have killed the latency requirement - that's why we need a huge machine like this.

The code was written in C

(We also maintain a Hadoop and Cassandra cluster, and I use Spark for distributed computation - but those are different projects)


how does the problem fair on gpus


If it requires synchronizing across workers, poorly.


depends how often you need to synchronize. look at deep learning, you need to sync the weights as often as possible


This is correct


price per second?


Pricing is here [0]. The 96 vCPU, 360GB instance costs $4.9405/hr. If you don't mind if it gets killed, $1.0401/hr (pre-emptible [1]).

[0] https://cloud.google.com/compute/pricing#machinetype

[1] https://cloud.google.com/compute/docs/instances/preemptible


The dominant model was price per minute, rounded up. Price per second is better for groups that spin up and spin down instances often.


What kind of computer is this? What motherboard supports this much memory?


Anything that can fit e.g. the Xeon Platinum 8180M (https://ark.intel.com/products/120498/Intel-Xeon-Platinum-81...), which supports up to 1.5TB (!) RAM - link two of these on one board and you get 3TB RAM support.

As for the RAM itself - take 12 DDR4 128GB modules, for example https://www.heise.de/preisvergleich/crucial-lrdimm-kit-128gb... , and fit them into three channels of four modules each to get 1536 MB.


The Super Micro SuperServer 8048B-TR4FT lists that it supports up to 12TB DDR4 ECC RAM (which could have 4xE7-8890v4 for 96 cores / 192 threads). And the 7088B-TR4FT lists that it supports up to 24TB DDR4 ETC RAM (with a corresponding 192 cores / 384 threads).


24 slots x 64GB per slot is 1.5 TB. HP recently announced a workstation that supports up to 3 TB of RAM


It's virtual, so it's up in the air. What motherboard supports 96 CPUs?


vCPU == cores. 2 Xeon Platinum 8180 get you 114 vCPUs.


Azure's M128 VM size has 128 vCPUs and 2048GB of memory already.


sure, for $20/hour


This is not a SKU to play with. If $20/hr is indeed the price (I don't know), this is the hourly cost of a couple of waiters. You get to run SAP on someone's infra and someone to support it.


And still no IPv6. In 2017.


Holy moly!


Can we stop this whole vCpu nonsense


Do people really enjoy not knowing what they are buying some providers provide some info on what vCPU is other don't. Many people think it's actual CPU core.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: