LibreOffice has a pretty powerful document conversion, which you can run headless. I'm guessing they are converting to HTML and perhaps other formats -- do they offer anything like that?
Edit: You can invoke it something like this:
soffice --headless --convert_to html file.doc
I'm just speculating, but it seems reasonable that it would open the document just like the regular LibreOffice, fetch external resources and so on.
It seems like they're converting it to PDF. When you click the .doc file in Dropbox, they open a PDF preview. It fits with the fact that the buzzs I've seen are only for the .doc file, not the HTML or the XLS.
you've already determined that it's running on an ec2 instance, but it's somehow "suspicious" that the user-agent is libreoffice? and you're a "security researcher" but "curious if this is an automated process"? please.
sure, dropbox might owe an explanation (even though you certainly gave them permission to do this in their TOS), and you can call me cynical and jaded, but this seems like pretty shameless FUD that appears to be tied to an effort to shill a new product.
EDIT: first i thought this was written by the HoneyDocs founder. now i'm actually unsure who the author is.
i'm not attacking him as a person, i'm attacking his actions, specifically in this case his shamelessly disingenuous linkbait guerrilla marketing masquerading as a public service announcement
It really does not matter if there is something more serious like a vulnerability in the question. And so what if the link was a covert advertisement, the community is still benefitted if a popular product is put under the scanner. It is a win-win.
except for the part where there is no actual vulnerability, it was actually a design decision[1], and the author of the article was raising ridiculous scaremongering questions that he knew the answers to in order to attract more attention.
> Further digging into the HoneyDocs data reveals a suspicious User Agent, LibreOffice. Now I’m curious if this is still an automated process or one that involves human interaction?
Yes, because humans use LibreOffice over SSH/X11 from an EC2 instance. Probably LibreOffice is being used for the parsing/rendering on a server. Probably for something innocent like generating thumbnails or text-only previews.
They are generating PDF for online viewing. Go to your files on dropbox site and click on a .doc file. A preview popup will appear.
Open/LibreOffice with Python bridge is quite handy in converting documents to PDF format and can be run in headless mode (using virtual frame buffer like xvfb) on a server.
Dropbox uses (used?) Crocodoc to do its document previews, which would be interesting now that Crocodoc has been acquired by Box (a Dropbox competitor). Crocodoc actually ran full Windows VMs to have Word interpret Word, unlike what was speculated elsewhere here (using LibreOffice) - it turns out pretty much everything else sucks pretty badly at rendering Word docs, largely because the format is a bloody nightmare of binary encoded blobs including OLE embeds, etc. My understanding was that these VMs were run on AWS Windows instances, which explains why the document was seen opened on an AWS cluster. I know they had a fun nightmare of a time getting the right licenses from Microsoft to do this.
I disagree. If you must not have your cloud storage provider reading the data you give them, then encrypt it before you upload it. However, if you merely don't want them to, but it's not a big deal if they do, there's nothing wrong with expecting them not to go trawling through your data just because they can.
Am I reading this right? A third-party service that protects you from third-party services? And you have to install it everywhere? And it's not FLOSS? Please tell me I'm reading this wrong.
Edit: Okay I see it's based on FLOSS and that's great, but as far as I can tell they're still asking you to install binary blobs, which makes the whole thing pointless.
I had a little trouble getting it to run on one of my older Mac OSX machines, but I'm pretty sure that was because I had the remains of a previous installation of MacFUSE messing things up -
There's also a MacOSX/iOS/Android/Windows commercial "wrapper" around EncFS which is fully compatible with the compiled-from-source versions of EncFS I've got running on Mac OSX and Linux (ARM and x86) - it's a "binary blob", but if your security/convenience tradeoff lets you consider that, have a look at BoxCryptor Classic:
For me, the tradeoff of having secured/encrypted files available on iOS is worth the decrease in security by relying on Secomba GmbH not backdooring me at the request of the NSA or ASIO (my local security agency) – or anybody further down the security agency or law enforcement foodchain. I'm not actually trying to protect myself against targeted surveillance by any sufficiently powerful nation-state, but I feel good about knowing I'm not quite so readily caught up in "dragnet" surveillance…
Hmmm, I wonder if TrueCrypt adequately secures a hidden volume's existence from an attacker (Dropbox) who can watch the patterns of your block level writes?
From what I understand, and again, I'm not with the company, it's not a cloud service, but rather downloadable / installable software that encrypts prior to storage on the disk.
And yeah, I see what you mean, but if you don't have access to the source, you don't know what they're making you install. I'm a huge FLOSS advocate, but in this specific instance it's more my paranoia talking. I believe I can trust them now, but how many clients will they need to have before the NSA blackmails them?
It's still a step forward, somewhat, but I find it hard to believe that there could be a successful product based on putting the user in full control (which is needed for real security).
The account is needed so you can grant authorization to others to access your files. EncFS is great if you are the only user. If you want to grant others access to your encrypted files, you need an authorization / authentication mechanism, which is why there's an account needed - for permission control.
But if you're the only user / potential accessor of the files, single-user strong file-system based encryption works.
I hate it whenever an article mentions a service or drops an affiliate link and someone's verdict is that the article looks like advertising. Do you prefer your reading content to be devoid of mentioning any products or brands? Should bloggers never make a dime off affiliate links?
Be concerned with the content and only the content. If the article has it, it's legit.
Eu odio iso, cada vez que un artigo menciona un servizo ou cae dun
enlace de afiliado e veredicto de alguén é que o artigo parece
publicidade. Prefire o seu contido de lectura a ser desprovisto de
mencionar os produtos ou marcas? Se bloggers nunca facer un centavo
off ligazóns afiliados?
Estar preocupado co contido e só o contido. O artigo ten iso, é
lexítimo.
If you are curious about other languages, I encourage you to test it out on your own! You can sign-up for a demo account at https://translate-o-rama.io/
That's spam, not useful data. No one upvoted that hypothetical offering of data. People did find this article interesting.
And hey, maybe I am interested in learning that language. Assuming your forum was appropriate (say, at a Galician convention), that could be a useful thing.
If the utility is high enough, then I might not mind. The problem is when the content's utility is insufficiently advanced and it is followed by a plug for a very related product/service that solves the problem identified in the content. That scenario casts suspicion on the content: "Did the author write this only to plug the product/service?". In this instance, the Dropbox info was sufficiently useful (it appeared to be an original discovery) that I don't see it as a huge problem.
> Be concerned with the content and only the content. If the article has it, it's legit.
Wrong. Context is everything. You cannot look at data in a vacuum. You need to look at where, why, when and how - especially when it's sensational; i.e. something that may cause someone to take action.
I never said you could. We're not discussing the philosophy of objective statements that don't require context, we're discussing whether true facts are tainted by subjective elements around them. By definition, they can't be.
If the content is good, it doesn't really matter why it's there. But if you believe the context implies that the data is incomplete (aka, biased) or actually wrong, that's obviously relevant.
Especially if something was an ad, who cares? The data speaks for itself. You just need to verify the ad is correct and isn't mis-representing itself. In this case, the data is fairly objective, they aren't comparing their product to someone else's, just pointing out an action that a product was able to help perform, and that action produced interesting data about another service.
The article didn't come off as an ad, either. My main gripe was that some people complain over any mention of affiliated services. (People also gripe about non-labeled affiliate links by independent-millionaire, popular, respected bloggers.)
I'm not a big fan of the content either. If you're going to imply Dropbox is doing something sneaky then I think you owe them the basic courtesy of a chance to comment or explain before you hit publish.
"Sneaky" or not – it's definitely unexpected behaviour. I certainly didn't expect .doc files I store in a Dropbox folder to magically go and fetch remote resources. Whether it's "explainable" or not (and I guess the "they're generating PDF thumbnails" is a plausible explanation), Dropbox haven't taken any steps to inform users that this happens.
There's little doubt that somebody will think up a way to take advantage of this to "leak" information.
I wonder if this happens before or after Dropbox's dedupe step? I wonder if that provides an avenue to extract useful data?
Content is not modified by the context, a fact is either true or it is not. Everyone has a motive, it reminds me of how people call into question research sponsored by corporations as if people who work in government sponsored research are some how automatically saints with no ulterior motive.
To trust someone based on affiliate links is a quite silly line of deductive reasoning.
From the information provided it seems simple enough to verify, embed an image via URL into a doc file, upload to dropbox, see if the URL is accessed. No need to argue about motive.
Certainly context is relevant. Understanding who is saying something and what their motives are helps you to judge how likely facts are to be true, and how much weight you should attach to opinions.
Calling some work into question because of the authors' motives isn't a claim that some other group has no ulterior motives at all. Certainly everyone has some motives, otherwise we would never get out of bed. But some of those motives will change the discussion more than others. E.g. when HP sponsors a study that finds that their own ink works out cheaper than buying remanufactured cartridges, it's perfectly sensible to be more suspicious of that than if a study by a consumer organisation found the same thing.
It does read like an ad for HoneyDocs, but this was actually the first time I've ever heard of it, and upon checking out their website it actually seems like an incredibly useful service. So if it is an ad, they appear to have succeeded.
LibreOffice is commonly used as part of a system to convert and generate previews for MS Office files. I would assume it has something to do with thumbnail generation or preview generation. However I don't seem to see thumbnails or previews of .Doc files (I do for images - for example) on the dropbox webapp - so maybe it's something their testing?
That would make sense to me. Especially since he only sees it on .doc files. Probably a thumbnail generator utility that uses LibreOffice plugin. Very interested to find out...
There's a saying that likely applies here:
"When you hear hoofbeats, think of horses not zebras"
I hope that the servers running LibraOffice only have that job. LibraOffice has a pretty massive attack surface and its not the kind of thing I'd like to leave running on a server with another purpose while accepting documents from pretty much anyone.
The only thing to see here is that DropBox is potentially opening themselves up to a vulnerability, would be interesting to see if GET file://etc/passwd worked...
On the one hand, it seems unlikely that an automated process would trigger external resource retrieval. In the same way, most processes that scan webpages for content or similarities don't run JavaScript, unless they are very sophisticated (this used to be a good way to protect against spam bots, for instance).
On the other hand, given how many files are uploaded to dropbox every hour, it's inconceivable that a human, whether through deliberate management direction or mischief, is opening all these documents. I would more concerned about human intervention if occasionally, a document triggered a buzz some days after it had been uploaded.
If all documents are showing as opened within 10 minutes, then surely it is just an anti-duplication automated agent at work.
The HTML, DOC, and XLS files all have identical structure (though different content). They are all HTML, and Honey Docs is relying on Word/excel's parsing the HTML in those files to fetch the image (a 1px gif).
I downloaded the credit card Honeydoc. The content looks like:
<html>Nicole Davis 4556062729618215<br />
Brian Baker 4556767839126624<br />
Patrick Jones 4916615717158539<br />
....
<br>
<br>
<br>
....
<img src="https://honeydocs.herokuapp.com/img/html/202719bb5717d5621068780180abc593b0fedda692bd63727a510911d21fdcbf.gif">
</html>
If all documents are showing as opened within 10 minutes, then surely it is just an anti-duplication automated agent at work.
It could perhaps be from generating a thumbnail... But dedup wouldn't work like that.. I would be very concerned about a dedup algo that requires interpreting the contents of a file, and dedup'ing based on that.
As I read it, the whole point of the "HoneyDoc" concept is that any access to the file generates a GET request. In other words it is specially crafted to ensure external resource retrieval.
Understanding the nature of the DropBox access would start with understanding how a "HoneyDoc" does what it claims it does.
Most dedup algos do compare the data. If you just use the hash, there's a very small chance of a hash collision causing file corruption / data loss. And depending on how common that block is, the corruption could affect a large number of files.
Its a byte comparison.. so you still wouldn't use libreoffice to compare files.
LibreOffice is not necessarily a sign of a human involment in the process, as it comes with a commandline interface to convert documents between various formats. So it could be thumbnail generation as Guillaume86 suggested.
The tracking behavior depends on a tracking pixel which may not always be processed by the client.
For example, with the credit_cards sample, the xls file is actually an HTML file with an img at the end (url linking to https://honeydocs.herokuapp.com/img/xls/...) and a client that only reads the plaintext (there are a boatload of command line utilities that fit the bill) won't fetch the image.
Dropbox uses crocodoc for MS Office file previews in the browser as html and my guess is crocodoc's tech is based on a custom print driver for LibreOffice that converts it into html.
Coincidentally (or not), I just received the invite to beta test Sync.com [1] today. Seems a Dropbox-clone, for the privacy conscious user. They claim that all files are encrypted, and they don't have access to the keys. The encryption algorithm is still private, but they say they'll open source it soon.
While I like the approach a lot more than Dropbox (that fights to obfuscate its own algorithm), I still don't feel safe. Anyone with access to the server could intercept your keys, and thus have access to your data.
TrueCrypt over some cloud-based solution is still the ideal option, but the lack of support for sparse images makes me hesitant.
EDIT: no affiliation with Sync.com (or Dropbox, for the matter). Just trying to find a decent cloud-based storage solution that fixes the exact problem exposed by the OP.
For further analysis, I would suggest embedding something nasty into a .doc. [1] Seriously, why would Dropbox execute code in arbitrary files; the only reason I can see is some virus scanner heuristic. So then they could spin up a new vm, load the file and diff the vm with a clean one. Or as others suggested, generate thumbnails; that, together with the 10 minute delay, would imply that they are running remote code on some batch processing machine. ( Where a lot of other files are up for graps.) Either way, it does smell somewhat.
[1] I am not sure how LibreOffice does handle active content and furthermore I am not sure if there is a way to generate a ping back from LibreOffice without some kind of active content embedded. But to me at least, it somewhat implies that Dropbox, or whoever, runs LibreOffice in a not maximally locked down configuration.
When you click on a .doc file in the Dropbox web interface, you get a preview of the file in PDF format. To do this, Dropbox must open and convert the file. LibreOffice is popular for this, as it can be run in a headless API mode, reads a wide range of files and can output PDF format. So this is what happens here.
The wisdom of executing "active" content embedded in such files is of course doubtful and something Dropbox should investigate. But if you want your files to be safe, you should instead use a service that encrypts them client side, which has the downside of losing the web interface that Dropbox offers (as this requires it to be able to access the decrypted files in order to serve them to you).
Posted this reply elsewhere, but SafeMonk encrypts your files before they hit your harddrive and keeps them encrypted in the Dropbox cloud. It's free for personal use: http://www.safemonk.com. Note: this is not my product, just using it after I saw it demo'd at a TechBreakfast.
In retrospect this was a very well done ad for HoneyDocs... I checked out the service and thought it was novel... wouldn't have looked if not for this.
The article is written in a such a way that they are saying a lot by playing dumb... so hard to say it's misleading... but I know few security people who'd write something up with this tone.
I would wager that they're opening it in order to generate a thumb or preview, or maybe for search indexing, and libreoffice is a good way to achieve this on linux - particularly if they're only opening it once, as they probably use the hash of the file.
We do exactly this on our eCommerce platform, before wanging stuff into s3 or glacier and just keeping a reference kicking around.
On the other hand, you have just discovered an information disclosure (host IPs) vulnerability in dropbox.
This seems unsafe; if I understand what this person has done, he'd essentially be coercing Dropbox's backend services to open arbitrary links on his behalf. That's a very dangerous capability to expose to adversaries.
to be fair, it's possible that dropbox understands this and has taken steps to sandbox and isolate the process that does this fetching from the rest of their internal infrastructure. if this is done for the purposes of generating thumbnails/online previews, and the .doc includes external resources, what other choice do they have but to fetch it?
The machine isn't the only thing at risk. Given this setup, it seems possible to use dropbox nodes to ddos an external target, just by uploading lots of documents, each containing lots of these links. It doesn't seem like they should be fetching external resources at all.
There are lots of services that generate traffic on your behalf. A very general rule is that you should have to send at least as many bytes as the service does, lest you become a DDOS multiplier.
I don't see a .doc file getting small enough to outsize a HTTP request inside of it, even if you used some funky compression, but I'm willing to hear otherwise.
One question would be if you could upload the document once and then somehow trigger a very tiny edit that causes them to rescan it.
We do use LibreOffice to render previews of Office documents for viewing in a browser, and have permitted external resource loading to make those previews as accurate as possible. While this could theoretically be used for DDoS, we haven’t seen any such behavior. However, just to be extra cautious we’ve temporarily disabled external resource loading while we explore alternatives.
As one part of your solution, I recommend restricting the machines that can make outbound requests to a certain pool, and then limit that pool's total bandwidth, throwing an alarm whenever the limit is hit.
It may be that you are big enough that even the limited bandwidth you need for normal operations is enough to take out smaller hosts, so you'd need to measure and monitor to see how well this works.
Could Dropbox perhaps let me disable this feature? I almost never use the web interface so I wouldn't miss it and I prefer that my documents are not opened after being synched.
again, it's not inconceivable that they understand this as well, and have some sort of rate limiting system in place. do you have a problem with google docs converting your office files?
Wow that is an interesting blog post, with a happy ending to boot (Amazon refunded the $1000+ in bandwidth fees, due to the accidental nature of the usage).
They could not fetch it and have a little blank bit in the thumbnail.
Chances are they're using a library they didn't develop and did not think of the possibility of external resources being loaded.
Edit: The most secure way I can think to handle preview generation is to have a virtual machine firewalled from the internet that previews a single document and is then reverted.
I agree -- It's the same concept but much more efficient.
I'd set up a docker to accept a single HTTP post with the document, and to return the thumbnail. The docker can then be shut down and a new instance spun up to wait for the next document to process.
It might be wasteful to spin up a new docker for each instance, but it's the only way to prevent some exploit in LibreOffice[1] that might leak information somehow. A leak could be as terrible as embedding an entire document in the next thumbnail, or as simple as returning the wrong thumbnail (like from a previous request).
[1] LibreOffice was the user-agent that phoned home in the article.
Brilliant! I wonder what kind of error messages LibreOffice embeds when it can't fetch a resource... could map out the internal network pretty quickly if it has distinct error messages.
Also docx files are zip files which opens the possibility of a zipbomb. I wonder if LibreOffice has protection for zipbombs.
That's what I was thinking too. The only thing is that this process is running on an AWS instance, so it would have to be on Dropbox's VPN or something to have any such access. Either serendipitously or intentionally, I hope these boxes don't have any connection to anything sensitive.
Or worse, if as other threads are speculating, libreoffice is being used to generate previews of the docs then an exploit in libreoffice could be used to get access to dropbox's backend
I wonder if you know which storage provider does not do this. As far as I each of these storage providers offer preview (including embeded images) thus they do need to open arbitrary links.
Edit: You can invoke it something like this:
I'm just speculating, but it seems reasonable that it would open the document just like the regular LibreOffice, fetch external resources and so on.