Gravatars: why publishing your email's hash is not a good idea
The guys at gravatar.com offer a nice service: for website owners, they let you automatically associate an avatar to your users, through the user's email address. The users who register to gravatars.com are able to change their gravatar and the change will be visible on all gravatar-enabled websites where they registered with the same email.
The association email -> avatar is done through a MD5 hash function. If you register to a website with firstname.lastname@example.org, the website will compute the hash of your email address (in this case 476c8a979eed603fb855dca149c7af6b) and associate the avatar url
http://www.gravatar.com/avatar/476c8a979eed603fb855dca149c7af6b?d=identiconto your profile. All other websites using gravatars will associate the same url to your profile, because the computation of
md5sum ( email@example.com )will always yield the same result.
If you register to gravatar.com with the same email, you are able to change the image associated to your md5 and so the various online communities you take part in will show the same picture next to your brilliant posts or comments and you'll present a consistent face to your online fans.
There is a piece of information which must be made public, though. It's this 32 char string which serves as a token for your web browser to retrieve the right image. How much information are we leaking to the bad people inhabiting the internet? Can that key be used to retrieve our email?
The email's hash gives a quick way to check whether a certain email is the one associated to your profile. Given a list of emails, you can check whether the user has registered with one of them.
Say that you would like to play a trick on user Michael Smith, who got on your nerves because of an Emacs vs. Vi thread gone too far. You see that his gravatar is
http://www.gravatar.com/avatar/e57f4aa121ea7a10d5fcfb492dbcf0de?s=32&d=identicon&r=PGso he must have registered with an email having md5 equal to e57f4aa121ea7a10d5fcfb492dbcf0de.
What could his email be?
Let's start trying with famousprovider.com addresses, which we know have 40% share of the email namespace at the moment. Let's check whether he registered as firstname.lastname@example.org
echo -n email@example.com | md5sum fbd4372942c7844add4b2372ada95ec0 -No luck. The md5 is different so the mail must be different.
Maybe firstname.lastname@example.org ?
echo -n email@example.com | md5sum bdc76f7d9c4c50de1426fd3465313d30 -No luck again. Let's go on with other combinations and let's also check lessfamousprovider.net and alternativeprovider.org, which are the other big players. And let's try all combinations of full name and reasonable separators: msmith, michael_smith, michael.smith michael-smith, michaelsmith, with some help from a for loop and grep.
After some attempts, we find that the md5 of firstname.lastname@example.org is the one we were looking for. Bingo! Expect a bunch of emails with vi fan testimonials, Mr. Emacs-fanboy!!!
Was this just a hypotetical lucky shot or can the leak be exploited in a real-world scenario?
The real world: stackoverflow.com
The approach I have just described is easy to automatize, with a little programming-fu. What we need is some test data.
stackoverflow.com is a nice forum for programming tips and nicely enough it uses gravatars and has an easily scrapable list of users from which you can extract the associated data. It's also a place where the problem of gravatar security was previously discussed, with some insightful posts for instance in this thread. It's going to be an interesting testbed.
From the usernames we can build a list of possible user-parts in the email address, as done before. For names made of few parts, we will choose possible combinations of upper/lower case and so on. For the domain name, we choose a handful of the most common email providers.
Running my program on a list of 80871 users I was able to extract 8597 email addresses, associated to their users. This means that for a bit more than 10% of the users, the username and the gravatar URL are enough to deduce the email address they used to register to the website. This requires about one hour of running time with a Haskell program generating email combinations and computing the md5sums. There is still space for a few additional combinations, but I suspect to be already well along the way of diminishing return: all additional complexity is going to detect only a few more email addresses, if any.
This attack is effective if you can deduce a limited set of emails which could belong to the users, but useless otherwise. If a user named paul registers as email@example.com then this approach doesn't work.
Another option is the use of rainbow tables. If you have or can generate a list of email addresses and compute the corresponding md5 hash, you can look for collisions in your list of gravatars. For example, you can generate lots of realistic looking email addresses from a list of names, family names and email domains. Any match will associate one email address from our list to a username.
I have tested this with a list of first name + family name combinations for a list of 18,000 frequent US family names. Out of the 80871 stackoverflow users, there were 869 matches. A bit above 1%. The previous use of information from the username itself yields much better results.
The rainbow table approach is interesting if you have the email addresses of people who are particularly likely to have subscribed to a site, or if you want to check whether you have users within a certain limited group of emails (for instance people you have in your addressbook).
A line of defense
A possible line of defense against this type of attacks is storing the images locally (which was suggested on a thread on stackoverflow). As a website owner wanting to use gravatars, you download them once when the user subscribes and serve them with an URL not giving up additional information (e.g. /avatar/username.png). To have updated avatars, you should refresh your images every now and then in case some of your users change their associated images on gravatar.com.
Even this defense is not bullet-proof, because it would be possible to replicate the previous attacks checking collisions not on the md5 values but on the images. This would require downloading an image from gravatar.com for each md5 computation, which would be much slower and possibly raise alerts.
Overall, gravatars seem to be one of the many web services where one gives up some information and privacy for being part of a community. At least, be aware of what you are giving up.