I come here for the opposite: I've already picked my disks (HGST), and I read the Backblaze report to raise my fist and yell "Y U still use Seagate!?!"
Because Seagate is cheaper. If a drive dies 5% sooner and costs 10% less, you buy the cheaper drive.
For personal drives in small quantities, you're basically operating on luck. You can get a bad HGST drive just like you can get a bad Seagate drive. I've had drives from HGST, WD, and Seagate and all of them have lasted far beyond the warranty expiration.
When I look into drives for small-scale personal use (NAS, desktop, etc), I check recent reviews about warranty and customer service, then buy based on price and features. I also buy drives based on the use case (NAS drives for NAS, desktop drives for desktop, etc).
(Clarify: I agree with what you're saying, but...)
You need to add "cost to repair" into your equation. At my old job we used to figure $300 to replace a drive (scheduling maintenance window with customer, taking backup, migrating/quiescing services), replacing drive, verifying services were happy, wiping or destroying old drive, managing RMA, qualifying/burning in new drive).
YMMV depending on policies and procedures. If I were Backblaze, I'd design the system to work equally well with 40 drives as with 45, and consider just leaving bad drives in until a handful of them need replacing in a chassis. And the replacement be basically automatic.
This version of the report is much closer than they've been in the past. Seems like Seagate might be making some improvements. IIRC, it was not uncommon for Seagates to be a 10x higher failure than HSGT in the past.
> You need to add "cost to repair" into your equation. At my old job we used to figure $300 to replace a drive
This is TOTALLY true, and Backblaze does add in the cost to repair. At our scale, we have full time datacenter technicians that are onsite 7 days a week for 12 hours a day (overlapping shifts) so that we can replace drives and repair equipment within a certain number of hours.
Every single day our monitoring systems kick out a list of about 10 drives that need to be replaced, and the data center techs try to make sure there are zero failed drives when their shift ends. (We leave failed drives alone for up to 12 hours at night as long as it is just one failed drive in a redundant group of 20.)
But if you are operating with FEWER than our 110,656 drives then you can't have full time repair people standing around paying their salary. So you are really going to need to pay much more (per drive) than we do.
One of the absolutely amazing things about "cloud services" like Amazon S3 or Backblaze B2 is that (hopefully) we can actually operate it at a lower cost and higher durability than you can operate yourself. That may not be true at either end of the spectrum - if you have fewer than 10 drives it may be cheaper for you to operate it yourself, and if you have more than 1,200 drives (the deployment size of one of our vaults) you might want to consider cutting out the middle-man (Backblaze) and saving yourself money. But I make the bold claim we can save you money in that sweet middle ground.
One of the factors to consider is the cost of electricity in your area. Backblaze pays about 9 cents/kWh which is "pretty good". Up in Oregon you can get deals for 2 cents/kWh and beat our operating costs, but in Hawaii you are at 50 cents/kWh and unless you have solar panels you pretty much better host data on the mainland. :-) I think TONS of people (mistakenly) think that after they purchase a hard drive, operating it is "free". Electricity is one of Backblaze's major costs in providing the service. Our electrical bill is more than $1 million per year right now.
Some people think we (Backblaze) get "magical good deals" by purchasing drives in bulk, and it isn't as good as you might think. Sure, we get some bulk discounts, but think 5% or 10% better than your retail price for 1 unit. And you can probably get THE SAME DEAL as Backblaze if you are purchasing 1,000 drives in one purchase order.
Backblaze is a pretty good deal. I'm currently trying to figure out if I should take my video files that I'm editing for my personal movies and store them on Backblaze (transmission time being the issue there) vs. getting a 10TB drive to put them on or a ZFS array, vs. just deleting the source. The latter has a very competitive price. :-D
I think you're over-estimating the work taken to replace a drive -- their storage is redundant using a custom Reed-Solomon error correction setup[1,2] (meaning no downtime while replacing a drive so none of the surrounding work you mentioned is necessary), and at the scale they're operating on they have staff just replacing drives (they are averaging 5 failures a day and from memory they had higher failure rates a few years ago) or bringing up new storage pods and other maintenance work.
While they do have multi-failure redundancy (17 data, 3 parity), it wouldn't be a good idea to push the limits regularly because that means you're risking the loss of actual customer data. And adding more parity to give you extra wiggle-room would result in both storage efficiency losses and performance losses. It probably wouldn't be advisable to expand the "stripe" size to cover an entire vault (to increase the independent disk failure tolerance without reducing the data-to-parity ratio) because you'd kill performance (the matrix multiplications necessary to do Reed-Solomon have polynomial complexity).
Ah, okay. Unless you have hot spares already in the storage pod (which they might do), I don't think it'd be trivial -- or even a good idea -- to mess with the data/parity shards after the fact.
> If I were Backblaze, I'd design the system to work equally well with 40 drives as with 45, and consider just leaving bad drives in until a handful of them need replacing in a chassis. And the replacement be basically automatic.
They designed their own replication scheme which allows quite a bit of disk to fail. Considering that, using RAID would be absurd so I'm pretty sure they are all configured as JBOD, thus allowing this kind of replacement.
Not necessarily. You loose a bit of data reliability(which is an important quality for a backup provider) by using a drive that fails more often, and you have to pay wages to workers that have to replace those failing drives. At a scale 5% sooner means additional employees just to replace those drives.
Yev here -> Hah! It's less expensive and as long as failure rates are below a threshold for us - we will usually skew less expensive. Now, if failures start creeping up there's a lot more math involved, but since we're built for failure, we are very calm when a drive fails.
In some environments when a drive goes down, everyone goes into "red alert" mode and they have to rush to repair the drive or replace it. In our case we don't have to drop everything and run to it - so it helps keep the overall replacement costs down.