# HMAC Timing Attacks

Alex Egg,

In order to test for the vulnerability, I will implemented a simple test server using Ruby. It verifies an HMAC and sends back “yes” or “no” if the value is correct. I will then write a client that connects to the server and tries various values for the HMAC. It tries each value multiple times and records a set of time differences for each. These can be fed to a program for statistical analysis to decide when a significant difference has been confirmed (based on mean and standard deviation).

The industry standard method to establish a user as “logged in” across sessions and requests is to use a “session cookie”. This is a file stored on the client (web browser). The server will usually store a user id in the session cookie. Then, when that session cookie is present – that means the user is effectively ”logged in”. The cookie is not stored in plain text, otherwise, one could simply type a different user name and be signed in as that person. For this reason, they are HMAC’d. Examples of session cookies from various websites:

• https://blackboard.sdsu.edu: C81D88E7DC7D77C06469DCD6CAC512F4
• http://github.com: BAh7CzoJdXNlcmkCYgQ6DGNvbnRleHQiBi86EF9jc3JmX3Rva2VuIjFlNUZsU09pbXhSVEYxaGZxbFRuRUNhRXV PcGR0S3Vvck3d2Z2Vm1JQW4wPToQZmluZ2VycHJpbnQiJWYzMTFlYmJjZGZmYzIyNzdlYjkxMjUzZWZkNTU3N2E2 IgpmbGFzaElDOidBY3Rpb25Db250cm9sbGVyOjpGbGFzaDo6Rmxhc2hIYXNoewAGOgpAdXNlZHsAOg9zZXNzaW9u X2lkIiU5NDMwZjRlODlmZWExNGVlZWZmZDZkNzMzNmJkMDYyZQ
• http://webportal.sdsu.edu: 1578565192|116813
• most PHP apps: kvl4tv9agg53p0p08ohhsd9h53
• most rails apps: BAh7CkkiD3Nlc3Npb25faWQGOgZFRkkiJTMyMGU1MWIwZWFiNTAzYTc0NGRkNjNlNDA2OTI4ZjkzBjsAVEkiEF9jc3Jm X3Rva2VuBjsARkkiMW1jczhoTjZGMWo5Skg2UDI4Y1RGbzB5ZUlFdFZwYTdxRndVaUx3dGdTYVk9BjsARkkiCm9hdXRo BjsARnsGSSIMdHdpdHRlcgY7AEZ7BkkiF2NhbGxiYWNrX2NvbmZpcm1lZAY7AEZUSSIMdXNlcl9pZAY7AEZpBkkiCmZs YXNoBjsARm86JUFjdGlvbkRpc3BhdGNoOjpGbGFzaDo6Rmxhc2hIYXNoCToKQHVzZWRvOghTZXQGOgpAaGFzaHsGOgtu b3RpY2VUOgxAY2xvc2VkRjoNQGZsYXNoZXN7BjsKSSIPU2lnbmVkIGluIQY7AEY6CUBub3cw
• Ing Direct Bank: 2A635FB6E6307E058498894EBA0B2AE4

Using the timing attack described above it is possible to spoof a valid HMAC, such as one from the above websites.

### Background/Experiment Setup

My test server HMAC is 40 characters (similar to BlackBoard or ing Direct Bank):

f5acdffbf0bb39b2cdf59ccc19625015b33f55fe


I have also setup a client app that can query the web app. When the client sends a challenge to the server (logs in) it will compare that character-by-character with the server’s version. This of course leaks timing information.

Starting with a challenge of:

0000000000000000000000000000000000000000


and iterating that to:

1000000000000000000000000000000000000000
2000000000000000000000000000000000000000
.
.
.
e000000000000000000000000000000000000000
f0000000000000000000000000000000000000000


and recording the amount of time (milliseconds) each request takes to return. If we repeat this n amount of times (say 50) we can look at the mean or median time to see which character takes the longest to return. If the mean/median response time for ‘f000….0” is the largest, it can be deducted with degree of certainty that, ‘f’, is the first character of the HMAC.

Then just repeat this process for each character, the next being “fX000….0”. Assume this turns out to be ‘5’, then repeat the process for “f5X000….0”. Until you have all 40 characters.
Experiment

My client app will start guessing all possible values for the first character: 0-f (hex) or 16 possible values. Assuming the server leaks timing information, theoretically one of the 16 requests should take longer than the others. See timing sample:

0 0.005450913
1 0.005829198
2 0.004905407
3 0.005286876
4 0.005597611
5 0.004814430
6 0.004969118
7 0.005335884
8 0.004433182
9 0.004440246
a 0.004860263
b 0.004561121
c 0.004463188
d 0.004406799
e 0.004978907
f 0.004887240


Figure 1: Timing results (in seconds) for first pass of test – notice there is no large delta between the actual value ‘f’ and the rest of the sample.

There seems to not be any apparent winner, all the results look very close. In other words there is a lot of noise hiding the signal. For this reason it is necessary to take multiple samples (scale the test) and use statistical tools to filter out the signal from the noise. We must scale the test by an arbitrary constant in order to separate the signal from the noise. I’ve found, through experimentation, that 500 is a good number. In other words: run the test in figure 1, 500 times, recording the results for each of the 500 trials.

Below in figure 2 is an example of the response time distribution over 500 samples. The x axis is the first character of the hash 0 - F (in decimal) and the y axis is the response times in seconds. Each column in the scatterplot has 500 ‘x’.s You’ll notice the obvious shift up (slower response) time for the 15 (F) column.

Figure 2: Shows the response times of 500 samples for each character 0-F representing the first character of the hash. Notice the highlighted portion has a significant shift up (slower response time).

I would then take the average response time of each character which would determine the winner. This is the intuitive/obvious solution, however, sometimes this would not show a significant shift (as in figure 2) unless I ran it at very high scale constant. This method worked fine for the human eye because at some point the scale constant would produce the above shift but it is hard to automate. After all, the goal is to have my client app compute the whole hash automaticly character-by-character.

In summary, this method was too deterministic and hard to automate .I found a better approach to filter the signal from the noise was a mean of means approach.

* For a technical discussion on why the webserver leaks timing info, see Appendix A.*

## Statistical Analysis/Algorithm

My algorithm makes 16 requests to the server and records the response time of each request and ranks them 1-16 where 1 is the longest (slowest) request and 16 is the shortest (fastest request). Theoretically, the first place winner is the correct character because it took longer to respond. So, in summary: we make 16 requests, 0-f, of the first character of the hash, 500 times and record the rank of each test. This can be visualized by figure 3:

{
"0"=>[7, 1, 3, 3, 15, 5, 4, 9, 15, 10, 13, 2, 14, 9, 4, 14, 7, 9, 15, 2, 14, 9, 14, 6, 11...],
"1"=>[13, 4, 7, 11, 0, 4, 0, 2, 14, 11, 6, 7, 2, 2, 14, 11, 8, 10, 5, 13, 11, 7, 4, 9, 3...],
"2"=>[14, 5, 15, 5, 1, 0, 3, 1, 9, 12, 4, 4, 1, 1, 8, 6, 9, 4, 9, 5, 8, 3, 12, 8, 5...],
"3"=>[15, 2, 9, 7, 2, 1, 14, 11, 7, 8, 8, 1, 4, 7, 12, 15, 13, 0, 4, 1, 7, 0, 3, 0, 0...],
"4"=>[12, 10, 14, 15, 8, 9, 10, 12, 10, 4, 1, 13, 15, 15, 3, 1, 6, 8, 2, 6, 15, 4, 0, 3, 2...],
"5"=>[5, 13, 13, 12, 7, 8, 13, 14, 3, 13, 2, 12, 7, 14, 2, 10, 12, 5, 8, 0, 4, 10, 5, 10, 12...]
"6"=>[0, 15, 11, 13, 5, 15, 8, 8, 4, 7, 12, 9, 10, 11, 11, 7, 0, 6, 0, 9, 2, 6, 15, 13, 14...]
"7"=>[1, 9, 0, 10, 6, 6, 2, 4, 12, 9, 5, 10, 5, 10, 7, 2, 4, 14, 6, 7, 13, 11, 6, 12, 4...],
"8"=>[4, 0, 2, 1, 9, 11, 12, 13, 11, 14, 0, 15, 9, 0, 0, 13, 11, 13, 1, 8, 6, 5, 11, 15, 7...],
"9"=>[11, 11, 10, 4, 13, 7, 6, 3, 2, 2, 14, 5, 3, 3, 15, 9, 14, 7, 10, 3, 0, 14, 1, 5, 15...],
"a"=>[8, 3, 6, 14, 10, 2, 7, 5, 1, 3, 3, 0, 0, 6, 10, 12, 15, 12, 12, 15, 9, 13, 13, 11, 9...],
"b"=>[9, 12, 5, 8, 3, 3, 5, 15, 0, 6, 11, 11, 12, 8, 1, 3, 1, 11, 11, 14, 5, 1, 2, 1, 6...],
"c"=>[6, 7, 8, 2, 12, 10, 9, 10, 6, 1, 10, 8, 6, 4, 6, 4, 3, 2, 7, 11, 1, 8, 7, 2, 13...],
"d"=>[2, 14, 4, 0, 14, 12, 11, 0, 8, 0, 15, 3, 8, 12, 5, 0, 10, 1, 3, 4, 12, 12, 8, 14, 8...],
"e"=>[10, 8, 12, 6, 11, 13, 1, 6, 13, 5, 7, 14, 11, 5, 9, 5, 2, 15, 14, 10, 10, 2, 10, 4, 1...],
"f"=>[3, 6, 1, 9, 4, 14, 15, 7, 5, 15, 9, 6, 13, 13, 13, 8, 5, 3, 13, 12, 3, 15, 9, 7, 10...]
}


Figure 3: 500 rankings for first character (only 25 samples shown). For example the character “0” was first ranked 7, 1, 3 then 3 again and so on 500 times.

The 500 rankings of each character are then averaged giving this sample output:

"f", 5.302
"0", 7.17
"6", 7.396
"3", 7.472
"5", 7.562
"a", 7.602
"2", 7.608
"8", 7.626
"9", 7.688
"b", 7.698
"1", 7.704
"e", 7.812
"4", 7.82
"d", 7.826
"7", 7.854
"c", 7.86


Figure 4: Average rankings of first character (smallest to largest order) smaller is better (winner). Notice the standard deviation of the samples besides “f”, the deviation is very small compared to that of “f”. This agrees w/ the timing information leak hypothesis. One of the responses took longer than the others. The others should be constant. it is also interesting to note that “f”’s average is so high and not closer to 0, b/c theoretically it should be 0 every time. This shows the importance of the distributed sampling, because there is obviously a large ammount noise in the samples.

It is apparent from figure 2, that “f” is the first character of the hash since it’s average ranking is the highest.

So, we repeat this algorithm for the remaining 39 characters. For each character of the hash (out of 40) we take the rank (1 out of 16) where 1 is the slowest. Then for each character ranks (500) we take the average and rank those. This scheme is a mean of the means. This is one statistical technique that allows us to filter the signal from the noise. So: 16 * 500 * 40 = 320,000 requests as opposed to a brute force technique that would cost 16^40 requests.
Analysis

On the first iteration there are HMAC possible combinations. After the algorithm deduces that ‘f’ is the first character the total number of combinations is reduced by down to . After it guesses the second character, the possible combinations is reduced down to and so on until you are at and the final character where there are only 16 possibilities. As the algorithm works down the tree it eliminates possibilities, where n is the character in the hash you are solving for or where n is the level in the tree in Figure 5. Each time you find a character you reduce the probability set down by an order of magnitude. You can imagine this tree structure:

<img src=”https://s3.amazonaws.com/eggie5_production/photo_blog/image10.png” width=700/>

Figure 5: A tree structure 40 levels deep representing the total possible HMACs. If we, for example, take the first branch that eliminates 15^40-n branches leaving 16^39 other HMACs, then 16^38 and so on until you have 16 and the final hash.

## Experiment Results

Now, with my statistical and testing framework established, I set out to prove my hypothesis correct. I have my test server with a private hash set up and a client app to compute the hash statistically character-by-character and to send the challenge to the server.
Below is the log output of an actual sample run where against the hash:

f5acdffbf0bb39b2cdf59ccc19625015b33f55fe
7.396715 - HMAC[0] = 'f'        f000000000000000000000000000000000000000
14.96593 - HMAC[1] = '5'        f500000000000000000000000000000000000000
22.778012 - HMAC[2] = 'a'        f5a0000000000000000000000000000000000000
31.018916 - HMAC[3] = 'c'        f5ac000000000000000000000000000000000000
39.440451 - HMAC[4] = 'd'        f5acd00000000000000000000000000000000000
48.152296 - HMAC[5] = 'f'        f5acdf0000000000000000000000000000000000
57.112024 - HMAC[6] = 'f'        f5acdff000000000000000000000000000000000
66.214377 - HMAC[7] = 'b'        f5acdffb00000000000000000000000000000000
75.636221 - HMAC[8] = 'f'        f5acdffbf0000000000000000000000000000000
85.688623 - HMAC[9] = '0'        f5acdffbf0000000000000000000000000000000
95.669175 - HMAC[10] = 'b'        f5acdffbf0b00000000000000000000000000000
105.84066 - HMAC[11] = 'b'        f5acdffbf0bb0000000000000000000000000000
116.357238 - HMAC[12] = '3'        f5acdffbf0bb3000000000000000000000000000
126.884667 - HMAC[13] = '9'        f5acdffbf0bb3900000000000000000000000000
137.762953 - HMAC[14] = 'b'        f5acdffbf0bb39b0000000000000000000000000
148.972121 - HMAC[15] = '2'        f5acdffbf0bb39b2000000000000000000000000
160.292312 - HMAC[16] = 'c'        f5acdffbf0bb39b2c00000000000000000000000
171.824551 - HMAC[17] = 'd'        f5acdffbf0bb39b2cd0000000000000000000000
183.688379 - HMAC[18] = 'f'        f5acdffbf0bb39b2cdf000000000000000000000
195.699373 - HMAC[19] = '5'        f5acdffbf0bb39b2cdf500000000000000000000
207.910364 - HMAC[20] = '9'        f5acdffbf0bb39b2cdf590000000000000000000
220.6681 - HMAC[21] = 'c'        f5acdffbf0bb39b2cdf59c000000000000000000
233.4213 - HMAC[22] = 'c'        f5acdffbf0bb39b2cdf59cc00000000000000000
246.318068 - HMAC[23] = 'c'        f5acdffbf0bb39b2cdf59ccc0000000000000000
259.586379 - HMAC[24] = '1'        f5acdffbf0bb39b2cdf59ccc1000000000000000
272.918426 - HMAC[25] = '9'        f5acdffbf0bb39b2cdf59ccc1900000000000000
286.477769 - HMAC[26] = '6'        f5acdffbf0bb39b2cdf59ccc1960000000000000
300.242127 - HMAC[27] = '2'        f5acdffbf0bb39b2cdf59ccc1962000000000000
314.256857 - HMAC[28] = '5'        f5acdffbf0bb39b2cdf59ccc1962500000000000
328.467831 - HMAC[29] = '0'        f5acdffbf0bb39b2cdf59ccc1962500000000000
342.970653 - HMAC[30] = '1'        f5acdffbf0bb39b2cdf59ccc1962501000000000
357.880405 - HMAC[31] = '5'        f5acdffbf0bb39b2cdf59ccc1962501500000000
372.784722 - HMAC[32] = 'b'        f5acdffbf0bb39b2cdf59ccc19625015b0000000
387.901408 - HMAC[33] = '3'        f5acdffbf0bb39b2cdf59ccc19625015b3000000
403.216801 - HMAC[34] = '3'        f5acdffbf0bb39b2cdf59ccc19625015b3300000
418.981904 - HMAC[35] = 'f'        f5acdffbf0bb39b2cdf59ccc19625015b33f0000
434.940265 - HMAC[36] = '5'        f5acdffbf0bb39b2cdf59ccc19625015b33f5000
451.143052 - HMAC[37] = '5'        f5acdffbf0bb39b2cdf59ccc19625015b33f5500
467.537535 - HMAC[38] = 'f'        f5acdffbf0bb39b2cdf59ccc19625015b33f55f0
484.221603 - HMAC[39] = 'e'        f5acdffbf0bb39b2cdf59ccc19625015b33f55fe
"Final computed HMAC=f5acdffbf0bb39b2cdf59ccc19625015b33f55fe"
Program exited with code #0 after 534.90 seconds.


Figure 6: Actual log output. First column is execution time in seconds. Second column is the character of the hash we are solving, the third column is the answer found and the last column is the computed hash so far.

It took about 8.9 minutes to spoof the 40 character hash on my server. Each character took about 7 seconds statistically deduce – roughly linear growth.

Imagine the implications here! Take for example blackboard or the ing Direct Bank with it’s similar 40 char HMAC. If I were to set loose this script on blackboard or the bank, within 9-20 minutes (taking into account more network overhead/jitter), I would be able to login as the student/client who was assigned the numerically smallest HMAC hash value – as my algorithm starts from 0 and increments up to FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.

What makes this vulnerability so dangerous is two things: it is not simple brute force and its difficult to defend against:

A simple brute force attack would take requests and even on a local connection that would take seconds, or longer than the known age of the universe! Using statistical analysis we are able to reduce the possible combinations by an order of magnitude () for every character index n, and spoof the hash in linear time.

There are only a few ways to prevent this exploit and for the large part is ignored by mainstream sites as evidenced by the list of HMACs I provide at the beginning of this article. One solution, however, used by the largest sites on the internet is ‘rate-limiting’. If one client makes too many connections past a threshold the client will be banned for a period. This would render this timing-attack exploit useless. Rate limiting is employed by Google, Facebook, Twitter and a few other internet giants. Another solution is to use a compare algorithm that doesn’t leak timing information in the first place. The ruby languages build-in String compare algorithm doesn’t short-circuit when a match isn’t found and thus doesn’t leak, unlike python or java.

In summary, replace time-variant compare algorithms with constant time algorithms. This vulnerability is well documented and easily remedied but for some reason is largely ignored.

# Appendix A

## Leaky Code

You may be wondering why, technically, does the web server leak this information. The reason for this time shift phenomenon is that, as described above, the code leaks timing information by “short-circuiting” when a match does not occur. In other words if byte 5 in a 10 byte array doesn’t match it returns 5 compares later. Or if byte 8 does not match it returns 8 compares later, each with a small time delta.

A snippet of the culprit code in my test server is below:

# check each character

  _s1.each_index do |i|
if (str1[i] != str2[i])
#this return on a non-match, is the “leaky” code
return false
end
end
return true


The same vulnerability is found in java and python:

return array1 == array2


and in Java:

return Arrays.equals(array1, array2);


Companion Source Code: