Hacker Newsnew | past | comments | ask | show | jobs | submit | pablohoffman's commentslogin

Hey folks, I'm the one interviewed, happy to answer any follow up question here.


Not surprised, I'm pretty low profile :)


Portia from Scrapinghub is 100% open source: http://scrapinghub.com/portia/


We released a similar open source tool for visual scraping, earlier this year, called Portia: https://github.com/scrapinghub/portia

It's been getting quite a bit of traction and we're currently working on the integration with Scrapinghub platform (disclaimer: I work there) for those who prefer a hosted version.


Not yet.


Yes, you just need to select a different field type ("text", instead of "html").


We are not plainly complaining about MongoDB, nor saying it's useless. We are just explaining why it's a poor choice for a specific use case: storing scraped data.

FWIW, we still use Mongo in other internal applications, it's just not the right choice for our crawl data storage backend.


One issue is that many of these points are design characteristics of MongoDB and should have been known before hand. I am not criticising but it's almost like you did zero research before hand.

Transactions for example have never existed in MongoDB and joins doesn't really make much sense.


Perhaps they did their research on MongoDB and knew all the limitations, but thought to themselves "meh, I can solve all that in the application code", and eventually found out it wasn't so easy to handle transactions and joins in the code?

After all, developers are rather susceptible to the "don't tell me I can't do that" behavior.


How was the evaluation process that led to using MongoDB in the first place?

At some point you must have compared it to, say, Postgres – which is what the section before the summary hints to.


We went with HBase. Cassandra would have been suitable too, but we already use Hadoop for data processing so it was a natural choice within the infrastructure ecosystem. We will write a followup about that.


Clouderan here! Glad to hear you guys went with HBase, I'm looking forward to your follow up post. Will you detail your key design / architectural setup?

Did you guys roll your own HBase environment or did you go with the CDH? If you're using the CDH version and have any questions, feel free to shoot an email to cdh-user.


We are using CDH4.2 and have had a very positive experience so far.

Cloudera has in fact been an inspiration for us to follow, you guys have really struck the right balance between open source and commercial support. We follow the same philosophy with Scrapy (an open source web crawling framework), as you do with Hadoop and its ecosystem.


That's really awesome to hear, thanks for your kind words. I'm looking forward to the follow up blog, depending on your key design you may be able to take advantage of Impala for ad-hoc queries using SQL.


The title of this post is misleading, it makes you think GitHub is actually broken, when the article is about complaining how their fork feature works.


I initially submitted this post, but then deleted it and resubmitted to the original post on Common Crawl blog: http://news.ycombinator.com/item?id=3208853

I now regret since this one got much more attention. I was under the impression that linking to the original post was more welcomed here HN, but it seems this is not always the case.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: