Introduction to Hash Functions
Hashing algorithms are at the core of many technologies, password security (use a salt too!), blockchain, peer to peer networking (torrents), verifying downloads and comparing files on a filesystem.
‘Hashing’ in extremely simple terms means we take an input, run some computation against it and return an output that is representative of the input. Whilest this sounds like any function type logic in any system out there hashes have some important characteristics that make them useful in many ways, before we dive in let’s run a simple MD5 hash against a value using this web tool:
Ok great, so we’ve fed in the value ‘China’ and it’s returned what looks like total junk, let’s try running the same hash using a different web tool:
Both tools returned the same output value. This is because both tools use the MD5 hashing algorithm, which will always return the same output provided the input is the same. Let’s try running a bigger, different string through either one of the web tools mentioned above:
Input: Pluto is not a planet
The input is longer but the output is the same length as ‘China’ and is also totally different, this is because hash outputs always have a fixed length depending upon the aloryhtm used. In this specific example our hash output will always be 32 characters in length, this is because we are using MD5 which is a 128 bit algorithm – each character represents 4 bits which leads to 32 total characters. Let’s try changing the input very slightly and running it through the same tool:
Input: pluto is not a planet
As you can see the output is totally changed by any kind of change to the input. Let’s take these characteristics and give an example as to how they could be useful in a more practical way:
I go to website A to download a file, website A uses an external provider – website B to host their files which they do not have control over. When I download the file from website A it will present me with the hash value of the file, and forward me to website B to download the file. After I’ve downloaded the file from website B I can run a hash function against the file to see if the hash is what website A told us it should be, if it is not the same then I know that the content hosted on website B is not the content I was expecting.
So in summary, hashing functions:
- Are always fixed in length according to the bit value of the algorithm
- Are deterministic representations of input
- Any change in input content results in a totally different output value
As you saw in the previous example – a hash value output fixes some fixed characteristics such as length. This can be a problem, as if we are representing an input value of a length that can be larger than the hash output we may run into a situation where two values share the same hash, this is called a hash collision.
To show this example in action, here are two hex strings I stole from a stack overflow post that will produce a collision:
When we paste this into one of our MD5 generators we will get:
So two different inputs produced the same output, the likelihood of a collision occuring is very, very low but it is something that we have to take into consideration when using hashing algorithms. Generally the tradeoff is that the higher the bit value, the more resource intensive the algorithm will be. Many hashing algorithms are insecure, context determines how important this is. For example, if you are downloading a file and just need to compare hashes it is usually fine to use something like MD5, however you would never use something like MD5 to hash a password or an SSL certificate.
To contextualise security we will take the use case of storing passwords in a database. Let’s say we have a table called ‘passwords’, you wouldn’t want to store plaintext passwords in there as anyone who had access to the database would be able to get access to the user data (people who do this should be shot). Instead we could look at hashing the value, then when the user logs in the value from the database is passed to an in memory process which determines if the hash is representative of the users password and authenticates them, anyone logging into the database cannot see the users password, only the hash of the password.
On the topic of passwords and hashes, it is not enough to use a secure algorithm. The most common way to crack password hashes is to use a ‘rainbow tables’ attack. This is easily explained by a use case, let’s say someone hacks into your database and steals the data contained in your hashed ‘password’ table, the attacker would then use a table of pre-computed hash values for common passwords (there’s always some people using something dumb like ‘password123!’) and run a script that will cycle through both databases until his hash value matches one of the hash values in your database. If the attacker finds a collision then he can just find the hash in your database, he then has your users password.
The issue with securing passwords this way is that you’re relying on your users to make strong passwords that are not vulnerable to a rainbow tables attack. This is a terrible idea as no password policy will protection you from the laziness of users. To get around this vulnerability we can use a ‘salt’, this is a value that is added to the password value before it is hashed, and is referenced at another point in the authentication process. It doesn’t matter if the attacker knows the salt, provided that the salt is different for each users. This means that in order to check your passwords against his ranbow table, the hacker will have to also correctly assign the correct salt to each users password hash. This is exponentially more challenging to handle and (should be) the industry standard way of securing user passwords.
This doesn’t mean you shouldn’t use a strong algorithm to hash values. You should always check Google to find out which is currently the best option for you. It is hard to say if there is specifically a ‘best’ algorithm as each have their own strengths and weaknesses. For example, many organisations now use SHA-256 which is near impossible to crack, however it is very cheap to attack SHA-256 as it is not computationally intensive.