Password Recovery – Part 1 – Theory

Have you ever thought what happens when you type your password in a login form? What happens to this top secret data? How is it protected?

Note: if you are reading this to get a decent understanding of how you should store passwords, read additional resources at the end. I am not a cryptography expert and the aim of this post is to give you a general high-level awareness of several issues and introduce you to practical part. If you can’t tell what is wrong with wacky solutions like: md5(sha1(md5(md5(password) + sha1(password)) + md5(password)))
you should read everything I link here and consult a security expert.

Problem

Let’s break the issue down to see what are the dangers and how can we mitigate those threats.

The user gives us some identifier only they know (in theory) that proves that they are who they claim to be. We take it and compare to what the user has previously told us.

This secret is something extremely valuable. So valuable that we tell the user not to write it down anywhere. If they can’t do it, we shouldn’t either, since all the hassle would be pointless if our infrastructure got attacked. Especially that it could get attacked from the inside: by a worker who has access to the database. What can we do to store it in a way that no one is able to retrieve it?

Hashing

Thank God we have magicians mathematics who invented something called hashing. Hashing is a process of converting data into some nearly unique representation. The trick is that this process is one-way. Once you get a hash, you can never tell what was the input (cases when this is not true will be discussed later and practical attack examples will be shown in next part).

Picture showing one-way hashing process

Some extremely simple hashing algorithm could assign a number to each letter and return a modulo division result. For example assign “a” value 1, “b” value 2 and so on and then give modulo 10. So password abc123 hash would be counted as:

a + b + c + 1 + 2 + 3 mod 10 = 1 + 2 + 3 + 1 + 2 + 3 mod 10 = 12 mod 10 = 2

Cool. We can take input from user, save it in a secure way and reapply this hashing every time we get the password in order to compare it with what we were provided. There are some limitations to this:

  • the representation needs to be fixed-length in order to make this process irreversible.
  • This poses a threat of having one representation for 2 different data sets. This exists and is called collision.
  • Two very similar data sets need to have totally different representation – the attacker shouldn’t be able to tell how close they are to guessing initial data. String “password1” and “password2” need to produce output that looks like some random data (BTW this means that a good hashing function could be used as a good random data generator).

If you consider the algorithm proposed above you’ll notice that while it delivers results that cannot be reversed (you can’t tell whether 2 was obtained by summing letters up to 2, 12, 22 or something else), and the output has a fixed-length, collisions are extremely easy to obtain. In real life more complex algorithms are used.

Let’s take a look at how this can be done. We apply hashing function called sha1 to strings “password1” and “password2”.

$ echo -n "password1" | shasum -a 1 e38ad214943daad1d64c102faec29de4afe9da3d  -
$ echo -n "password2" | shasum -a 1
2aa60a8ff7fcd473d321e0146afd9e26df395147  -

Looks like we have solved the problem. Similar entries give much different output, length is quite big, which means that we have a lot of different outputs (sha1 can be 160 bits long which gives you 1461501637330902918203684832716283019655932542976) different hashes, so collision chances are low etc. Great improvement when you compare this solution to the initially proposed solution.

Hashing issues

As you can guess there are other problems that did not show up, but that exist. Simple data can be turned into hashes in a simple way. How simply? Got to crackstation and enter one of the hashes above. You’ll get the input in less than a second.

It’s not because SHA-1 algorithm is faulty (it is but its faultiness is beyond this post and my knowledge). It’s because the “password1” string has been guessed so many times someone has already saved the input and is able to repeat it quickly every time they come across it. Moreover, counting the output is fast. It’s OK when we want to ensure that a file we download from the internet has not been changed and we want to check a huge file quickly. But when it comes to password cracking, it’s a major drawback.

Slow hashes

In order to slow down the process of counting the hash we use slow hashes.

openwall.info

Slow hashes, on the other hand, have different design goals. They are expected to be copied and subsequently attacked by crackers. Thus, they are designed to be inefficient and more difficult to calculate. Some examples of these slow hashes are bcrypt, PBKDF2 and scrypt. Also, slow hashes are not as widely available and not as simple to implement as fast hashes.

The idea is simple: make calculating so inefficient that the attacker can check a finite, low amount of hashes in a given time. So inefficient that going from “password1” to “password117648” will take so long that the data will already become useless.

In next part I will show you how much difference you can get by using different types of hashes by simulating a real attack. For now just remember this: for storing passwords use slow hashes.

Salting

Even with slow hashes we still have a problem of user giving same, weak password for different services. Let’s imagine that someone has “password1” representation for scrypt/bcrypt/whatever. Now every time they see the same function has been used (which can be guessed based on several aspects) and see output, they can still tell what was the input. That’s where salting comes in.

Salting is a process of adding a random data to user input which makes same passwords have different hashes. If two users set their password to “password1” and in first case we add “123” and in second case “321” to this input, we get different hashes and therefore knowing representation of “password1” is useless. The attacker will have to calculate results on their own for each input separately using slow hash function which will make this process even more inefficient.

Cool, but what about practice?

Seems like we have arrived at 99,9% bulletproof way of storing passwords. Slow hashing + process of salting should give us proper protection against cracking passwords. There is of course more to be done to tackle the remaining 0,1% of issues, but for this I recommend going through additional resources section. In next part I will simulate an attack on various ways of storing passwords.

Additional resources:

Since I am not a cryptographer, I wouldn’t feel this article would be complete without links to articles and resources created by people smarter than me: