Still waiting for someone to open source an observability db that uses distributions as the core storage abstraction instead of just single values.
I thought Circonus’ IronDB was open source but they’ve either brought it back closed source or I was mistaken.
In any case it seems weird that everyone is just rehashing the same space with metrics dbs and not moving to where Google/Circonus have been for a while.
I'm not sure I understand. Google, Circonus, etc. all store distributions as scalar values. A histogram is merely a collection of scalar values, where the collection is a set of bounded buckets, where the boundaries are typically adjacent to one another (e.g., 0.1 <= x < 0.2, 0.2 <= x < 0.3, etc.). You can do this with any timeseries DB.
The aggregation functionality is best done in a separate layer so that you have the flexibility to change how you create your distributions without having to change your storage backend.
They store scalar values in that they are binary encoded data structures. But the data structures they actually encode at the cell levels they histogram bins in a datastructure optimized for histogram calculations (in the Circonus case log linear histograms encoded as 3 values[0]). These histogram structures are optimized for fast re-aggregation and quantization across time/dimension. By optimizing for histograms first and plugging things like counts and gauges into them (rather than the reverse for something like Prometheus) they get around a lot of the bad statistics and expensive query calculations that other metrics systems have.
This is different than say how Prometheus stores histograms where it treats the bucket boundaries as a dimension in a fixed bucket histogram and the cell as the count of items in that bucket.
Veneur[1] can compute approximate global percentiles (among other things) during metric collection, and the percentiles can be stored and queried in even in a datastore that doesn't know anything about distributions.
As fas as I have seen, 99% of in-house solutions are tailored for exactly 1 use case - the company that built it. Existing solutions like InfluxDB are made way more generic.
Companies like Twitter, Uber etc can spend many millions and won't notice so they can build tailored databases for themselves.
Will someone else use it? Probably not (sure unless this is something new or that differs a lot from existing solutions and gives a lot of benefits)
I think it has (sadly) become common for people/companies to adopt software just/mostly because it came from Twitter/FAANG or whatever company that is high on hype...
Also, TimescaleDB is a general purpose time-series database, which means that it not only works for metrics data, but also many other datatypes (strings, arrays, JSON, etc). You can also create indexes on any of these fields.
Because TimescaleDB is built on Postgres, it is also a relational time-series database. Ie you can store metadata / business data alongside your time-series data.
While HDFS is indeed used for exporting old data and storing some partition mapping metadata, it's clear from the blog post that MetricsDB is much more reliant on BlobStore as well as MetricsDB-specific services.
> The servers checkpoint in-memory data every two hours to durable storage, Blobstore. We are using Blobstore as durable storage so that our process can be run on our shared compute platform with lower management overhead.
This applies to all time series databases... What is special about time series that makes unique databases attractive? Is it the assumption that only 1 main index is created (time)? Is is that the data is append-only?
In particular, I'd love to know if theres anything major that generic RDBMS's could do better here.
Modern column-oriented relational data warehouses will handle time-series data fine in 95% of cases.
What "time-series" databases offer is more functionality around time being a primary component of the data. For example, automatic partitoning/sharding on time, various date handling techniques, better bucketing/gap filling/smoothing functions, data retention policies based on time, automatic rollups and aggregations, etc.
Some have custom data stores, some use key/value stores like Hbase or Cassandra, and others use relational databases. Using relational foundations offers more flexibility (like Timescale on top of Postgres) than the others like InfluxDB or OpenTSDB.
Yes, time-dependence and only numbers allows to use a lot of tricks that cannot be done with generic data.
For example, double delta compression, or Gorilla for floating point numbers. For more, take a look at our open source VictoriaMetrics database, which uses all of such tricks.
Yep. The nature of the time series data does allow for storage optimizations like that. But there is the usage pattern aspect as well, that is more interesting in my opinion.
- Availability preferred over consistency (you want your metrics when bad things are happening, even if they may not be 100% accurate)
- Extremely write heavy, with most data never being read (they mentioned that only 2% of data is ever read. in my experience at another large company, it was way less than 2%)
- For the data that is read, most of reads are for most recent points (mostly alarms, but some dashboards as well)
- Different SLAs for queries based on usage - ie. queries for alarms must be fast. Dashboards, and trend analysis - not so much.
ClickHouse DBMS allows to combine delta, double delta, Gorilla with LZ4 or ZSTD for column compression. But it is not often used as DB for monitoring metrics so something else is probably expected from time series DB.
It would work, but Clickhouse is a Russian (Yandex) thing, and SQL isn't really needed for most monitoring and alerting use cases.
If I didn't want to use MySQL or Postgres, I'd rank Prometheus #1 and Clickhouse #2.
The killer thing for Clickhouse is that Percona supports it, so if you want to outsource the installation, mgmt. and support, you can just write a check and get good results.
Also, Clickhouse is a column store with SQL, so you could use an instance for monitoring and another to replace Vertica or Greenplum or whatever so long as it has the client libraries you need.
ClickHouse is licensed under Apache License 2.0 and Yandex is incorporated in the Netherlands. What are your concerns with it being developed by russians (other than xenophobia)?
It comes down to performance, usability, and cost.
Time-series databases make specific architectural decisions and introduce advanced capabilities that enable orders of magnitude better insert/query performance, while also reducing storage cost via compression.
For example, TimescaleDB (where I work), is a relational time-series database built on top of Postgres, which means it includes all of the goodness within Postgres, but also achieves:
* 96%+ compression (ie only uses ~4% of the storage of Postgres) [1]
* 100x-1000x faster queries than Mongo [2]
* 10x higher inserts and 50-1000x faster queries than Cassandra [3]
If your RDBMS is good enough - then please keep using it. :-) But if query latency, insert performance, or cost are becoming concerns, then I'd suggest looking at a relational time-series database.
While those numbers are good, TimescaleDB does not even attempt to compete with the performance of purpose-built time series DBs (e.g. shakti, clickhouse).
Time series DBs (and OLAP dbs in general) have very different trade-offs/needs than transactional DBs.
> TimescaleDB does not even attempt to compete with the performance of purpose-built time series DBs
Not sure what you mean by this. TimescaleDB outperforms InfluxDB, another purpose-built time-series DB, on most workloads [0], especially on ones with high-cardinality [1].
In particular our key insight, which some may still find heretical and hard-to-believe, is that it is quite possible to produce best-in-class performance characteristics for time-series using a relational database. In particular, we have been able to add columnar compression to our row-oriented format resulting in 96%+ compression rates [2], and multi-node scale-out resulting in 10M+ inserts per second [3].
TimescaleDB is not designed for workloads where most queries touch all data points (ie full table scans). But that's not what time-series workloads look like.
> In particular, I'd love to know if theres anything major that generic RDBMS's could do better here.
Well, everybody with experience outsources monitoring now since it's a non-core cost center, unless there's a compelling scale or secrecy issue.
If RAM and CPU were free, I'd use MySQL or Postgres w/partitions because of their mgmt. features, tested replication and SQL.
But Prometheus or Clickhouse are 10-25x more efficient in terms of space, and often have much faster queries. The tradeoffs are bizarre HA gaps, lack of trained people, and ops groups are stuck supporting it.
I would never recommend monitoring with anything based on HDFS (OpenTSDB), written in Java (Cassandra), or in-memory for large clusters (InfluxDB.)
For monitoring under 200 nodes, anything will work.
If you only have a day to do something, just install Nagios and you'll get 99% of what you really need.
> Well, everybody with experience outsources monitoring now
That has not been my experience. Quite the opposite every place I’ve been that outsourced monitoring ended up bringing back significant portions of their observability stack either for needing more control, different feature sets or because the outsourced solution was cost prohibitive.
I think it is true that lots of teams continue to outsource storage of metrics data but the outsourced vendors are not incentivized to make it easy to do retention/filtering/aggregation well.
Source: Have run observability stacks across a variety of domains.
> Well, everybody with experience outsources monitoring now since it's a non-core cost center, unless there's a compelling scale or secrecy issue.
In this scenario I find it odd that there are so many opensource projects with traction and success (one name above all, Prometheus) if they are addressed only to specialized companies. What you say applies to small companies or very big ones with lot of money to spend. Mid-sized companies in my experience prefer to spend money on in-house solution because at their volumes outsourcing is really costly. But that's just my small experience.
Monitoring how your platform behaves/perform is a cost-center? I think it's a quite important feature in every tech company and depending on the size of the company it may be better to outsource it or do it internally.
I'd be curious to see a couple of organizations that have taken different approaches in this area discuss how they made decisions around investing in observability systems. When metrics gets to be a meaningful fraction of your total data size, and when it gets to be an operational burden on its own, I think it's tempting to ask "Do we really need to emit and store all this?" especially if you acknowledge that only ~2% of writes ever get read.
In some sense, the organization is making an expected value of information estimate, where there's a complex interaction
between the data you have and actions you can take, especially to preempt or resolve issues.
I've seen two approaches. (We have a lot of customers who deal with this exact problem.)
1.) Down-sample data to aggregates. Keep the aggregated but drop the source data after a certain period of time. You can see historical trends down to the sampling interval but cannot drill down to individual observations.
2.) Use tiered storage. Keep recent data on NVMe SSD and migrate to high density HDD or object storage over time. This allows you to keep more data. Not all DBMS support this out of the box.
Compression for correlated measurements can be spectacular. ClickHouse can often get 99.9% reduction in data size with the right combination of codecs (delta, double-delta, gorilla, etc.) and compression (e.g., ZSTD). Sort order within tables is also important.
And yet another company that builds its own (time-series) database. Why does this keep happening for such a narrowly defined use-case at similar scales?
At 1.5 petabytes, they could just dump this in Google's BigQuery for the cost of an engineer and have full SQL power.
Bigquery does not have anything close to real time query support like you want/need for observability data. And plus the cost is more like one engineer per query. That shit is crazy expensive.
What do you mean by real-time? BQ can scan a petabyte in less than a minute, but partitioning and clustering will get most simple queries back in a few seconds. If you mean data freshness than BQ supports streaming ingest and will show those rows in the next query.
SQL also lets you do a lot of aggregations in the database instead of pulling back raw metrics. And BQ has flat-rate pricing and discounts for large clients.
Less than a minute is not real time. Something like < 100ms for most queries would be real time.
And then do that 10s of 1000s of times per minute, every minute, 24 hours a day.
You can see how the query costs add up...
I know it's easy to come on HN and play armchair architect but this is one area where you are just incorrect in your assumptions. Did Twitter need to build a(nother) in-house database? I don't know for sure. But I do know that BQ is not well suited to observability use cases.
As I said, that's for scanning a petabyte. Selective queries would be much smaller and faster, costs can be handled with flat-rate pricing, and there's the BI Engine (in-memory cache) feature too.
The details in the article (1.5PB logical data, 3x replication factor, only 2% of metrics ever read) and the fact that Twitter already runs infra in Google Cloud seem to align well with BQ in my experience (10+ years in adtech building even more complex backends).
Please tell me what assumptions are incorrect so I can reconsider.
My experience is ~2.5 years as a senior engineer on the Observability team at Twitter (from 2013-2015). I was part of the migration to Manhattan (mentioned in the post) from Cassandra, new alerting infra, query language design, among many other things. Your adtech experience should make you particularly sensitive to query latencies, so I find it interesting that you're glossing over that.
Yes, only 2% of metrics ever read. That was the same back then too. The kicker is that you can't be sure which will be read, so the systems are built to be able to read any of the metrics with the same SLA. This is especially critical during an outage where engineers will need to quickly read metrics that in many cases were only written only a few seconds ago that they wouldn't otherwise read and are not configured as part of an ongoing alert.
Additionally, the alerting infrastructure that runs on top of the TSDB is configured with 10k+ queries that run every minute. So even at 2% read, you're querying them over and over and over again because you always need the latest data plus whatever trailing data is needed to fulfill the alerting needs (trailing 10m, hour, day, month, etc). This also makes it a particularly hard caching problem.
I can't speak to the state of GCP at Twitter. I know they had started to migrate some things but when I was there it was all colo.
Could they use BQ? Sure it could probably be tooled with partitions/caching/etc to work I guess, but at the end of the day it's not what BQ is designed for which is data warehousing and BI. Could they have used something that wasn't a custom built in-house database? Yea almost certainly (No secret that there is some NIH going on at Twitter well before this)
This happens alll the time on HN so I'm not here to call you out specifically...but it's really easy to read a company's blog post about their infrastructure decisions and immediately scoff and say "psh why didn't they just use X?"...as if you now have the same context that the engineering team has by reading a single blog post.
Try to run a few petabyte-scale queries and you will be surprised when you see the invoice. If you switch to flat-rate pricing, don't expect the latency that you mentioned.
You dont need to scan the whole dataset. Partitioning + clustering, especially for time-series, should be very efficient. BQ flat-rate pricing starts at 10k/month for a fixed number of processing slots. You can buy more slots to fit your query load.
Anyways the point is that it's well within the financial means of Twitter, especially when compared to developing and operating this proprietary system.
If I'm not mistaken Twitter is mostly on GCP so they could just go with BigQuery instead of developing an in-house solution. We probably need to ask them in order to get a proper answer. :)
Certainly depends on how you use it. If you plan accordingly and the data fits a compatible partitioning strategy, things work out well and costs can be manageable/predictable.
5 billion events per second? That seems to be too much for monitoring data even at Twitter's scale. The article refers that they store the raw data and use rollup tables for aggregate queries, I wonder how much the ELT process costs in total.
You write to S3 in chunks for long term storage and keep hot data live. You don’t write every metric directly. Along with compression It shouldn’t be that 5 billion/s.
Even with chunks it's prohibitively expensive at this scale, and you then need some intermediate system batching the inbound data either to disk (then why use S3 instead of a database) or in memory (more expensive, more risk of data loss).
Can someone offer up how this is different than InfluxDB? Also a bunch of machine log folks are jumping on the observability 'marketing' bandwagon - i.e. Elastic Search, Splunk, Devo, SumoLogic etc. Interestingly don't see them here..