It's been getting quite a bit of traction and we're currently working on the integration with Scrapinghub platform (disclaimer: I work there) for those who prefer a hosted version.
We are not plainly complaining about MongoDB, nor saying it's useless. We are just explaining why it's a poor choice for a specific use case: storing scraped data.
FWIW, we still use Mongo in other internal applications, it's just not the right choice for our crawl data storage backend.
One issue is that many of these points are design characteristics of MongoDB and should have been known before hand. I am not criticising but it's almost like you did zero research before hand.
Transactions for example have never existed in MongoDB and joins doesn't really make much sense.
Perhaps they did their research on MongoDB and knew all the limitations, but thought to themselves "meh, I can solve all that in the application code", and eventually found out it wasn't so easy to handle transactions and joins in the code?
After all, developers are rather susceptible to the "don't tell me I can't do that" behavior.
We went with HBase. Cassandra would have been suitable too, but we already use Hadoop for data processing so it was a natural choice within the infrastructure ecosystem. We will write a followup about that.
Clouderan here! Glad to hear you guys went with HBase, I'm looking forward to your follow up post. Will you detail your key design / architectural setup?
Did you guys roll your own HBase environment or did you go with the CDH? If you're using the CDH version and have any questions, feel free to shoot an email to cdh-user.
We are using CDH4.2 and have had a very positive experience so far.
Cloudera has in fact been an inspiration for us to follow, you guys have really struck the right balance between open source and commercial support. We follow the same philosophy with Scrapy (an open source web crawling framework), as you do with Hadoop and its ecosystem.
That's really awesome to hear, thanks for your kind words. I'm looking forward to the follow up blog, depending on your key design you may be able to take advantage of Impala for ad-hoc queries using SQL.
The title of this post is misleading, it makes you think GitHub is actually broken, when the article is about complaining how their fork feature works.
I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.