Introduction to Hash Functions

Introduction to Hash Functions

July 23, 2018 Off By kex

Hashing algorithms are at the core of many technologies, password security (use a salt too!), blockchain, peer to peer networking (torrents), verifying downloads and comparing files on a filesystem.

Hashes

‘Hashing’ in extremely simple terms means we take an input, run some computation against it and return an output that is representative of the input. Whilest this sounds like any function type logic in any system out there hashes have some important characteristics that make them useful in many ways, before we dive in let’s run a simple MD5 hash against a value using this web tool:

Input: China

Output: ae54a5c026f31ada088992587d92cb3a

Ok great, so we’ve fed in the value ‘China’ and it’s returned what looks like total junk, let’s try running the same hash using a different web tool:

Input: China

Output: ae54a5c026f31ada088992587d92cb3a

Both tools returned the same output value. This is because both tools use the MD5 hashing algorithm, which will always return the same output provided the input is the same. Let’s try running a bigger, different string through either one of the web tools mentioned above:

Input: Pluto is not a planet

Output: 47194ebfbfd656d58e09310c45b75063

The input is longer but the output is the same length as ‘China’ and is also totally different, this is because hash outputs always have a fixed length depending upon the aloryhtm used. In this specific example our hash output will always be 32 characters in length, this is because we are using MD5 which is a 128 bit algorithm – each character represents 4 bits which leads to 32 total characters. Let’s try changing the input very slightly and running it through the same tool:

Input: pluto is not a planet

Output: 5301278b1c2977caf9b5e02da78298ed

As you can see the output is totally changed by any kind of change to the input. Let’s take these characteristics and give an example as to how they could be useful in a more practical way:

I go to website A to download a file, website A uses an external provider – website B to host their files which they do not have control over. When I download the file from website A it will present me with the hash value of the file, and forward me to website B to download the file. After I’ve downloaded the file from website B I can run a hash function against the file to see if the hash is what website A told us it should be, if it is not the same then I know that the content hosted on website B is not the content I was expecting.

So in summary, hashing functions:

  1. Are always fixed in length according to the bit value of the algorithm
  2. Are deterministic representations of input
  3. Any change in input content results in a totally different output value

Hash Colisions

As you saw in the previous example – a hash value output fixes some fixed characteristics such as length. This can be a problem, as if we are representing an input value of a length that can be larger than the hash output we may run into a situation where two values share the same hash, this is called a hash collision.

To show this example in action, here are two hex strings I stole from a stack overflow post that will produce a collision:
4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa200a8284bf36e8e4b55b35f427593d849676da0d1555d8360fb5f07fea2
4dc968ff0ee35c209572d4777b721587d36fa7b21bdc56b74a3dc0783e7b9518afbfa202a8284bf36e8e4b55b35f427593d849676da0d1d55d8360fb5f07fea2

When we paste this into one of our MD5 generators we will get:

4ef3d1e2b3356aadd0170d604b32865c

So two different inputs produced the same output, the likelihood of a collision occuring is very, very low but it is something that we have to take into consideration when using hashing algorithms. Generally the tradeoff is that the higher the bit value, the more resource intensive the algorithm will be. Many hashing algorithms are insecure, context determines how important this is. For example, if you are downloading a file and just need to compare hashes it is usually fine to use something like MD5, however you would never use something like MD5 to hash a password or an SSL certificate.

Security

To contextualise security we will take the use case of storing passwords in a database. Let’s say we have a table called ‘passwords’, you wouldn’t want to store plaintext passwords in there as anyone who had access to the database would be able to get access to the user data (people who do this should be shot). Instead we could look at hashing the value, then when the user logs in the value from the database is passed to an in memory process which determines if the hash is representative of the users password and authenticates them, anyone logging into the database cannot see the users password, only the hash of the password.

On the topic of passwords and hashes, it is not enough to use a secure algorithm. The most common way to crack password hashes is to use a ‘rainbow tables’ attack. This is easily explained by a use case, let’s say someone hacks into your database and steals the data contained in your hashed ‘password’ table, the attacker would then use a table of pre-computed hash values for common passwords (there’s always some people using something dumb like ‘password123!’) and run a script that will cycle through both databases until his hash value matches one of the hash values in your database. If the attacker finds a collision then he can just find the hash in your database, he then has your users password.

The issue with securing passwords this way is that you’re relying on your users to make strong passwords that are not vulnerable to a rainbow tables attack. This is a terrible idea as no password policy will protection you from the laziness of users. To get around this vulnerability we can use a ‘salt’, this is a value that is added to the password value before it is hashed, and is referenced at another point in the authentication process. It doesn’t matter if the attacker knows the salt, provided that the salt is different for each users. This means that in order to check your passwords against his ranbow table, the hacker will have to also correctly assign the correct salt to each users password hash. This is exponentially more challenging to handle and (should be) the industry standard way of securing user passwords.

This doesn’t mean you shouldn’t use a strong algorithm to hash values. You should always check Google to find out which is currently the best option for you. It is hard to say if there is specifically a ‘best’ algorithm as each have their own strengths and weaknesses. For example, many organisations now use SHA-256 which is near impossible to crack, however it is very cheap to attack SHA-256 as it is not computationally intensive.