Is it possible that data isn't getting bigger - but that the people who work with it just want to process larger data sets than before?
I mean before they'd train a model of 1,000 inputs and then test it against another 50 and call it a day. Now they want to train it against 1,000,000 inputs.
Am I completely off base? It's not my area, though I work with databases, my observation is that developers always want to use the most data possible even when it doesn't really provide any benefit.
Sorry, I was being a bit playful with language. What I mean is that, if you roughly define small, medium, and large data in terms of the strategies required to process, then the absolute size of the data that can be processed using simpler methods grows.
And whether or not more data is needed or collectible varies by discipline. Astrophysics collects way more data than they used to because 1. they need it. 2. instrumentation allows it.
Some kinds of data collection hasn't scaled up however. Surveying humans is expensive and labor intensive. And for many things that you might want to study about humans, you can't simply afix a sensor to them. So, what might have been only accomplished through big data, or medium data methods a few years ago can now be loaded into memory (i.e. small data strategies).
That is my experience recently. Developers storing 500GB on a database (pre-launch), with < 1GB of meaningful data. A bunch of json logs that they knew data science would want eventually, but couldn't be bothered to either pare down or put in a more sensible place.
The thing is, it didn't really matter; Postgres still had a ton of performance left over even after the product went live. If you can still fit it in RAM, why waste $$$ of dev time over the $$ cost of a bigger instance.
I mean before they'd train a model of 1,000 inputs and then test it against another 50 and call it a day. Now they want to train it against 1,000,000 inputs.
Am I completely off base? It's not my area, though I work with databases, my observation is that developers always want to use the most data possible even when it doesn't really provide any benefit.