MetricsDB: TimeSeries Database for storing metrics at Twitter

wolf550e · on June 17, 2020

A blog post by Dan Luu ('luu) about using this to save twitter a lot of money on RAM for JVMs: https://danluu.com/metrics-analytics/

kasey_junk · on June 17, 2020

Still waiting for someone to open source an observability db that uses distributions as the core storage abstraction instead of just single values.

I thought Circonus’ IronDB was open source but they’ve either brought it back closed source or I was mistaken.

In any case it seems weird that everyone is just rehashing the same space with metrics dbs and not moving to where Google/Circonus have been for a while.

ekabod · on June 17, 2020

clickhouse as aggregate functions (Pearson correlation coefficient,quantiles , etc... )[0].

Associated with their Materialized view and Liveview features, you can achieve observability [1].

[0]https://clickhouse-docs.readthedocs.io/en/latest

[1]https://www.altinity.com/blog/2019/11/13/making-data-come-to...

Edit: fixed broken link.

hodgesrm · on June 17, 2020

That second link seems to be broken. The base URL is: https://www.altinity.com/blog/2019/11/13/making-data-come-to...

ekabod · on June 17, 2020

fixed. Thanks.

otterley · on June 17, 2020

I'm not sure I understand. Google, Circonus, etc. all store distributions as scalar values. A histogram is merely a collection of scalar values, where the collection is a set of bounded buckets, where the boundaries are typically adjacent to one another (e.g., 0.1 <= x < 0.2, 0.2 <= x < 0.3, etc.). You can do this with any timeseries DB.

The aggregation functionality is best done in a separate layer so that you have the flexibility to change how you create your distributions without having to change your storage backend.

kasey_junk · on June 17, 2020

They store scalar values in that they are binary encoded data structures. But the data structures they actually encode at the cell levels they histogram bins in a datastructure optimized for histogram calculations (in the Circonus case log linear histograms encoded as 3 values[0]). These histogram structures are optimized for fast re-aggregation and quantization across time/dimension. By optimizing for histograms first and plugging things like counts and gauges into them (rather than the reverse for something like Prometheus) they get around a lot of the bad statistics and expensive query calculations that other metrics systems have.

This is different than say how Prometheus stores histograms where it treats the bucket boundaries as a dimension in a fixed bucket histogram and the cell as the count of items in that bucket.

[0] https://www.circonus.com/2018/05/effective-management-of-hig...

otterley · on June 18, 2020

Thanks for the link! This is some good reading.

alexhutcheson · on June 17, 2020

Veneur[1] can compute approximate global percentiles (among other things) during metric collection, and the percentiles can be stored and queried in even in a datastore that doesn't know anything about distributions.

[1] https://github.com/stripe/veneur

kasey_junk · on June 17, 2020

But not across dimensionality and because veneur has outsourced the storage of the metrics you can’t calculate arbitrary percentiles after the fact.

zeeboo · on June 17, 2020

I've been working on something like this for some years now: https://github.com/zeebo/rothko

polskibus · on June 17, 2020

Is this open source? How does it differ from TimescaleDB or InfluxDB?

risyachka · on June 17, 2020

As fas as I have seen, 99% of in-house solutions are tailored for exactly 1 use case - the company that built it. Existing solutions like InfluxDB are made way more generic.

Companies like Twitter, Uber etc can spend many millions and won't notice so they can build tailored databases for themselves.

Will someone else use it? Probably not (sure unless this is something new or that differs a lot from existing solutions and gives a lot of benefits)

privateprofile · on June 17, 2020

I think it has (sadly) become common for people/companies to adopt software just/mostly because it came from Twitter/FAANG or whatever company that is high on hype...

akulkarni · on June 17, 2020

Also, TimescaleDB is a general purpose time-series database, which means that it not only works for metrics data, but also many other datatypes (strings, arrays, JSON, etc). You can also create indexes on any of these fields.

Because TimescaleDB is built on Postgres, it is also a relational time-series database. Ie you can store metadata / business data alongside your time-series data.

(Note: I work on TimescaleDB.)

redis_mlc · on June 17, 2020

MetricsDB appears to be HDFS-based. TimescaleDB is Postgres-based and InfluxDB is memory and log-based. Prometheus is file-based.

squarecog · on June 17, 2020

While HDFS is indeed used for exporting old data and storing some partition mapping metadata, it's clear from the blog post that MetricsDB is much more reliant on BlobStore as well as MetricsDB-specific services.

> The servers checkpoint in-memory data every two hours to durable storage, Blobstore. We are using Blobstore as durable storage so that our process can be run on our shared compute platform with lower management overhead.

sytse · on June 17, 2020

Thanks for this clear summary of the differences.

etaioinshrdlu · on June 17, 2020

This applies to all time series databases... What is special about time series that makes unique databases attractive? Is it the assumption that only 1 main index is created (time)? Is is that the data is append-only?

In particular, I'd love to know if theres anything major that generic RDBMS's could do better here.

manigandham · on June 17, 2020

Modern column-oriented relational data warehouses will handle time-series data fine in 95% of cases.

What "time-series" databases offer is more functionality around time being a primary component of the data. For example, automatic partitoning/sharding on time, various date handling techniques, better bucketing/gap filling/smoothing functions, data retention policies based on time, automatic rollups and aggregations, etc.

Some have custom data stores, some use key/value stores like Hbase or Cassandra, and others use relational databases. Using relational foundations offers more flexibility (like Timescale on top of Postgres) than the others like InfluxDB or OpenTSDB.

dima_vm · on June 17, 2020

Yes, time-dependence and only numbers allows to use a lot of tricks that cannot be done with generic data.

For example, double delta compression, or Gorilla for floating point numbers. For more, take a look at our open source VictoriaMetrics database, which uses all of such tricks.

sharpy · on June 17, 2020

Yep. The nature of the time series data does allow for storage optimizations like that. But there is the usage pattern aspect as well, that is more interesting in my opinion.

- Availability preferred over consistency (you want your metrics when bad things are happening, even if they may not be 100% accurate)

- Extremely write heavy, with most data never being read (they mentioned that only 2% of data is ever read. in my experience at another large company, it was way less than 2%)

- For the data that is read, most of reads are for most recent points (mostly alarms, but some dashboards as well)

- Different SLAs for queries based on usage - ie. queries for alarms must be fast. Dashboards, and trend analysis - not so much.

citrin_ru · on June 17, 2020

ClickHouse DBMS allows to combine delta, double delta, Gorilla with LZ4 or ZSTD for column compression. But it is not often used as DB for monitoring metrics so something else is probably expected from time series DB.

redis_mlc · on June 17, 2020

It would work, but Clickhouse is a Russian (Yandex) thing, and SQL isn't really needed for most monitoring and alerting use cases.

If I didn't want to use MySQL or Postgres, I'd rank Prometheus #1 and Clickhouse #2.

The killer thing for Clickhouse is that Percona supports it, so if you want to outsource the installation, mgmt. and support, you can just write a check and get good results.

Also, Clickhouse is a column store with SQL, so you could use an instance for monitoring and another to replace Vertica or Greenplum or whatever so long as it has the client libraries you need.

rcatcher · on June 17, 2020

ClickHouse is licensed under Apache License 2.0 and Yandex is incorporated in the Netherlands. What are your concerns with it being developed by russians (other than xenophobia)?

redis_mlc · on June 18, 2020

No, you're the xenophobe.

I'm concerned with there being a support issue, as well as a smaller user base also affecting support in the long term.

akulkarni · on June 17, 2020

It comes down to performance, usability, and cost.

Time-series databases make specific architectural decisions and introduce advanced capabilities that enable orders of magnitude better insert/query performance, while also reducing storage cost via compression.

For example, TimescaleDB (where I work), is a relational time-series database built on top of Postgres, which means it includes all of the goodness within Postgres, but also achieves:

* 96%+ compression (ie only uses ~4% of the storage of Postgres) [1]

* 100x-1000x faster queries than Mongo [2]

* 10x higher inserts and 50-1000x faster queries than Cassandra [3]

If your RDBMS is good enough - then please keep using it. :-) But if query latency, insert performance, or cost are becoming concerns, then I'd suggest looking at a relational time-series database.

[1] https://blog.timescale.com/blog/building-columnar-compressio...

[2] https://blog.timescale.com/blog/how-to-store-time-series-dat...

[3] https://blog.timescale.com/blog/time-series-data-cassandra-v...

snicker7 · on June 18, 2020

While those numbers are good, TimescaleDB does not even attempt to compete with the performance of purpose-built time series DBs (e.g. shakti, clickhouse).

Time series DBs (and OLAP dbs in general) have very different trade-offs/needs than transactional DBs.

akulkarni · on June 18, 2020

> TimescaleDB does not even attempt to compete with the performance of purpose-built time series DBs

Not sure what you mean by this. TimescaleDB outperforms InfluxDB, another purpose-built time-series DB, on most workloads [0], especially on ones with high-cardinality [1].

In particular our key insight, which some may still find heretical and hard-to-believe, is that it is quite possible to produce best-in-class performance characteristics for time-series using a relational database. In particular, we have been able to add columnar compression to our row-oriented format resulting in 96%+ compression rates [2], and multi-node scale-out resulting in 10M+ inserts per second [3].

TimescaleDB is not designed for workloads where most queries touch all data points (ie full table scans). But that's not what time-series workloads look like.

[0] https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

[1] https://blog.timescale.com/blog/what-is-high-cardinality-how...

[2] https://blog.timescale.com/blog/building-columnar-compressio...

[3] https://blog.timescale.com/blog/building-a-distributed-time-...

redis_mlc · on June 17, 2020

> In particular, I'd love to know if theres anything major that generic RDBMS's could do better here.

Well, everybody with experience outsources monitoring now since it's a non-core cost center, unless there's a compelling scale or secrecy issue.

If RAM and CPU were free, I'd use MySQL or Postgres w/partitions because of their mgmt. features, tested replication and SQL.

But Prometheus or Clickhouse are 10-25x more efficient in terms of space, and often have much faster queries. The tradeoffs are bizarre HA gaps, lack of trained people, and ops groups are stuck supporting it.

I would never recommend monitoring with anything based on HDFS (OpenTSDB), written in Java (Cassandra), or in-memory for large clusters (InfluxDB.)

For monitoring under 200 nodes, anything will work.

If you only have a day to do something, just install Nagios and you'll get 99% of what you really need.

Source: DBA.

kasey_junk · on June 17, 2020

> Well, everybody with experience outsources monitoring now

That has not been my experience. Quite the opposite every place I’ve been that outsourced monitoring ended up bringing back significant portions of their observability stack either for needing more control, different feature sets or because the outsourced solution was cost prohibitive.

I think it is true that lots of teams continue to outsource storage of metrics data but the outsourced vendors are not incentivized to make it easy to do retention/filtering/aggregation well.

Source: Have run observability stacks across a variety of domains.

darkwater · on June 17, 2020

> Well, everybody with experience outsources monitoring now since it's a non-core cost center, unless there's a compelling scale or secrecy issue.

In this scenario I find it odd that there are so many opensource projects with traction and success (one name above all, Prometheus) if they are addressed only to specialized companies. What you say applies to small companies or very big ones with lot of money to spend. Mid-sized companies in my experience prefer to spend money on in-house solution because at their volumes outsourcing is really costly. But that's just my small experience.

redis_mlc · on June 18, 2020

Get back to me in a year when your ops staff gets tired of 24x7 support for a cost center.

darkwater · on June 22, 2020

Monitoring how your platform behaves/perform is a cost-center? I think it's a quite important feature in every tech company and depending on the size of the company it may be better to outsource it or do it internally.

abeppu · on June 17, 2020

I'd be curious to see a couple of organizations that have taken different approaches in this area discuss how they made decisions around investing in observability systems. When metrics gets to be a meaningful fraction of your total data size, and when it gets to be an operational burden on its own, I think it's tempting to ask "Do we really need to emit and store all this?" especially if you acknowledge that only ~2% of writes ever get read.

In some sense, the organization is making an expected value of information estimate, where there's a complex interaction between the data you have and actions you can take, especially to preempt or resolve issues.

hodgesrm · on June 17, 2020

I've seen two approaches. (We have a lot of customers who deal with this exact problem.)

1.) Down-sample data to aggregates. Keep the aggregated but drop the source data after a certain period of time. You can see historical trends down to the sampling interval but cannot drill down to individual observations.

2.) Use tiered storage. Keep recent data on NVMe SSD and migrate to high density HDD or object storage over time. This allows you to keep more data. Not all DBMS support this out of the box.

DSingularity · on June 17, 2020

1.2 PB saved using gorilla compression! Didn’t realize how sparse this data was.

cbsmith · on June 17, 2020

Not necessarily sparse. Just lacking in entropy.

You can compress an arithmetic series down to a handful of bytes...

hodgesrm · on June 17, 2020

Compression for correlated measurements can be spectacular. ClickHouse can often get 99.9% reduction in data size with the right combination of codecs (delta, double-delta, gorilla, etc.) and compression (e.g., ZSTD). Sort order within tables is also important.

akulkarni · on June 17, 2020

Compression isn't just about sparse-ness, but optimizations one can make based on the dataset.

E.g., Gorilla uses delta-delta-encoding for time-stamps, and a novel XOR-based compression scheme for floats.

For more, here is a primer on Gorilla and other leading time-series algorithms:

https://blog.timescale.com/blog/time-series-compression-algo...

valyala · on June 19, 2020

BTW, there are algorithms with better compression rate for time series data than Gorilla - see https://medium.com/faun/victoriametrics-achieving-better-com...

manigandham · on June 17, 2020

And yet another company that builds its own (time-series) database. Why does this keep happening for such a narrowly defined use-case at similar scales?

At 1.5 petabytes, they could just dump this in Google's BigQuery for the cost of an engineer and have full SQL power.

rco8786 · on June 17, 2020

Bigquery does not have anything close to real time query support like you want/need for observability data. And plus the cost is more like one engineer per query. That shit is crazy expensive.

manigandham · on June 17, 2020

What do you mean by real-time? BQ can scan a petabyte in less than a minute, but partitioning and clustering will get most simple queries back in a few seconds. If you mean data freshness than BQ supports streaming ingest and will show those rows in the next query.

SQL also lets you do a lot of aggregations in the database instead of pulling back raw metrics. And BQ has flat-rate pricing and discounts for large clients.

rco8786 · on June 17, 2020

Less than a minute is not real time. Something like < 100ms for most queries would be real time.

And then do that 10s of 1000s of times per minute, every minute, 24 hours a day.

You can see how the query costs add up...

I know it's easy to come on HN and play armchair architect but this is one area where you are just incorrect in your assumptions. Did Twitter need to build a(nother) in-house database? I don't know for sure. But I do know that BQ is not well suited to observability use cases.

manigandham · on June 17, 2020

As I said, that's for scanning a petabyte. Selective queries would be much smaller and faster, costs can be handled with flat-rate pricing, and there's the BI Engine (in-memory cache) feature too.

The details in the article (1.5PB logical data, 3x replication factor, only 2% of metrics ever read) and the fact that Twitter already runs infra in Google Cloud seem to align well with BQ in my experience (10+ years in adtech building even more complex backends).

Please tell me what assumptions are incorrect so I can reconsider.

rco8786 · on June 17, 2020

My experience is ~2.5 years as a senior engineer on the Observability team at Twitter (from 2013-2015). I was part of the migration to Manhattan (mentioned in the post) from Cassandra, new alerting infra, query language design, among many other things. Your adtech experience should make you particularly sensitive to query latencies, so I find it interesting that you're glossing over that.

Yes, only 2% of metrics ever read. That was the same back then too. The kicker is that you can't be sure which will be read, so the systems are built to be able to read any of the metrics with the same SLA. This is especially critical during an outage where engineers will need to quickly read metrics that in many cases were only written only a few seconds ago that they wouldn't otherwise read and are not configured as part of an ongoing alert.

Additionally, the alerting infrastructure that runs on top of the TSDB is configured with 10k+ queries that run every minute. So even at 2% read, you're querying them over and over and over again because you always need the latest data plus whatever trailing data is needed to fulfill the alerting needs (trailing 10m, hour, day, month, etc). This also makes it a particularly hard caching problem.

I can't speak to the state of GCP at Twitter. I know they had started to migrate some things but when I was there it was all colo.

Could they use BQ? Sure it could probably be tooled with partitions/caching/etc to work I guess, but at the end of the day it's not what BQ is designed for which is data warehousing and BI. Could they have used something that wasn't a custom built in-house database? Yea almost certainly (No secret that there is some NIH going on at Twitter well before this)

This happens alll the time on HN so I'm not here to call you out specifically...but it's really easy to read a company's blog post about their infrastructure decisions and immediately scoff and say "psh why didn't they just use X?"...as if you now have the same context that the engineering team has by reading a single blog post.

buremba · on June 17, 2020

Try to run a few petabyte-scale queries and you will be surprised when you see the invoice. If you switch to flat-rate pricing, don't expect the latency that you mentioned.

manigandham · on June 17, 2020

You dont need to scan the whole dataset. Partitioning + clustering, especially for time-series, should be very efficient. BQ flat-rate pricing starts at 10k/month for a fixed number of processing slots. You can buy more slots to fit your query load.

Anyways the point is that it's well within the financial means of Twitter, especially when compared to developing and operating this proprietary system.

buremba · on June 17, 2020

If I'm not mistaken Twitter is mostly on GCP so they could just go with BigQuery instead of developing an in-house solution. We probably need to ask them in order to get a proper answer. :)

MoOmer · on June 17, 2020

Certainly depends on how you use it. If you plan accordingly and the data fits a compatible partitioning strategy, things work out well and costs can be manageable/predictable.

nitred · on June 17, 2020

Could someone tell me the disadvantages of solving the problem in the article using S3? I am actually curious.

RhodesianHunter · on June 17, 2020

Metrics on S3 at scale? It's about the worst tool you could use for this task.

The cost of puts alone would kill you.

riyadparvez · on June 17, 2020

Off topic, doesn't Twitter use their own data centers and GCP? When did they start using AWS?

redis_mlc · on June 17, 2020

They're inserting 5 billion events per second (for some reason), and AWS throttles everything (the following link mentions a limit of 3,500/second):

https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3...

Also, if they're planning to do alerting, S3 would not help with that.

PUTs are 1 cent per 2000:

https://aws.amazon.com/s3/pricing/

buremba · on June 17, 2020

5 billion events per second? That seems to be too much for monitoring data even at Twitter's scale. The article refers that they store the raw data and use rollup tables for aggregate queries, I wonder how much the ELT process costs in total.

dan-robertson · on June 17, 2020

Indeed, the article mentions 5b metrics per minute

redis_mlc · on June 17, 2020

> 5 billion events per second?

When you work for a large company, you have 2 problems:

1) Every has-been manager bikesheds over metrics retention

2) you can't say no. See #1.

deepakhj · on June 17, 2020

You write to S3 in chunks for long term storage and keep hot data live. You don’t write every metric directly. Along with compression It shouldn’t be that 5 billion/s.

RhodesianHunter · on June 17, 2020

Even with chunks it's prohibitively expensive at this scale, and you then need some intermediate system batching the inbound data either to disk (then why use S3 instead of a database) or in memory (more expensive, more risk of data loss).

deppp · on June 17, 2020

If there is interest, we're working on time-series data labeling tool. Sneak peek at it here:

https://htx-pub.s3.amazonaws.com/Screen+Shot+2020-06-17+at+1...

If you think it might be useful and you want to try it out drop me an email michael @ heartex.net

ssv33 · on June 17, 2020

Can someone offer up how this is different than InfluxDB? Also a bunch of machine log folks are jumping on the observability 'marketing' bandwagon - i.e. Elastic Search, Splunk, Devo, SumoLogic etc. Interestingly don't see them here..