Recently Félix Saparelli posted ("
Bruteforcing The Devil") about using brute force to find a preimage for a simple hash function (associated Hacker News thread
here). In pseudocode, the hash function computes a 32-bit hash as follows:
function hash(str)
h = 0
for c in str
h = (h << 5) - h + c
return h
The goal was to find an acceptable input string which would produce a hash value of exactly 666 (the "devil" of the post's title). The string had to have at most 30 characters, and consist only of ASCII values in a certain set (letters, numbers, and some punctuation characters; there was some slight corruption in his post due to a caret ('^') character, but a later code fragment made the proper set clear). For the rest of this post I limit myself to considering strings of lowercase characters only, but the adjustments to larger character ranges are not overly difficult.
One approach is to simply generate lots of strings and hope to find a matching hash. This is actually more feasible than it may sound: The number of possible hash values is a little over four billion, so we would expect to find a match every four billion hashes or so. At a handwaved guess, the amortised cost of generating and hashing a ten character string in C should be well under a hundred instructions, so one would hope to find matches sooner than every couple of minutes even without being very clever. (Indeed, the first ten matches from the simple program (
Program 1, below) that I wrote to test this claim occurred at 143, 166, 235, 303, 348, 417, 440, 554, 577, and 645 seconds, so the average was close to a minute.)
This was the approach that Félix used, although he wrote in Rust instead of C and understandably took a performance hit from doing so. (Quite a severe one, based on his figures, but I suspect a large part of that was the name generation since he aimed to generate readable strings.) He ended up computing around 35 billion hashes and finding four matches, so he seemed to be somewhat unlucky.
Once can do much better than full brute force by exploiting the structure of the specific hash function. We could instead write the iteration step as
h = 31*h + c, which makes that structure clearer. Indeed, a little thought shows that the hash function simply treats the string as a polynomial (with coefficients equal to the ASCII value of the characters in the string) and evaluates this polynomial at 31. So our problem becomes to find polynomials which have the required evaluation, under the appropriate restriction on the coefficients.
Note that if we split our string
S into two parts
S1 and
S2 then hash(
S) = hash(
S1)*31^length(
S2) + hash(
S2). (Here '^' represents exponentiation, not the C xor operator.) We can thus adopt the standard splitting technique: Suppose that we precompute hash(
S2) for a large set of strings
S2 of length
L and store these in some structure easily indexed by hash(
S2). Then for a given
S1 we can search for an
S2 such that hash(
S1)*31^
L + hash(
S2) = 666, which is to say hash(
S2) = 666 - hash(
S1)*31^
L. This is just a lookup on the precomputed hash value, and if successful results in a matching string
S =
S1 S2.
Using
Program 1, searching for all matching strings of eight lowercase letters took around 2600 seconds. With the splitting approach (
Program 2), computing hashes for all four-character strings, the same search took just 0.04 seconds — around 65000 times faster. Applying the same process to length 10 (hashing five-character strings of lowercase letters) took 1.2 seconds to find all 32755 such strings.
There is another option, though. We want to find polynomials
S (with coefficients in the required ranges) so that
S(31) ≡ 666 mod 2^32. This is a parametrised family of equations
S(31) = 666 +
k*2^32 for non-negative integers
k. If we fix a particular value of
k, we can write the right hand side in base 31 and more-or-less read off the coefficients of
S. There is a slight wrinkle in that we want coefficients larger than 31, but it is not too hard to compensate for this from the base-31 representation.
Note that if we are after values of a specific length then we can computer appropriate bounds on
k ahead of time, and the search space is bounded. For instance, for strings of 8 lowercase letters
k must lie in the range [643 .. 807], so we only need consider 165 possible values. My simple version written in
Magma (
Program 3) takes less than a tenth of the time that the C version above did. For the strings of ten lowercase letters, the same approach found all suitable values in under a second (0.94 seconds). Obviously there is slowdown here due to Magma being interpreted; I would expect a C version of the same algorithm to be rather faster again.
The search space is, of course, still exponential, so there's still a good deal of brute force required. Nonetheless, we can see from the above that we can avoid a great deal of that brute force with a bit of analysis.
Program 1 (C, searches all strings of seven lowercase letters, naïve brute force)
#include <stdio.h>
#include <stdint.h>
uint32_t hash(char *str)
{
uint32_t h;
h = 0;
for (; *str; str++)
h = (h << 5) - h + (uint32_t)(unsigned char)(*str);
return h;
}
/* searches through all lower-case options only */
void search(int len, uint32_t target)
{
char str[31]; /* large enough for now */
int i;
for (i = 0; i < len; i++)
str[i] = 'a';
str[len] = '\0';
do
{
if (hash(str) == target)
printf("%s\n", str);
for (i = len - 1; i >= 0; i--)
{
if (str[i] != 'z')
{
str[i]++;
break;
}
str[i] = 'a';
}
} while (i >= 0);
}
int main()
{
search(7, 666);
return 0;
}
Program 2 (C, searches all strings of ten lowercase letters, meet-in-the-middle approach)
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef struct
{
uint32_t hash;
uint32_t ident;
} entry;
int entry_cmp(const void *ap, const void *bp)
{
const entry *a, *b;
a = ap;
b = bp;
if (a->hash < b->hash)
return -1;
if (a->hash > b->hash)
return 1;
return 0;
}
void print_ident(uint32_t ident, int len)
{
char buf[10]; /* big kludge, but will be large enough */
int i;
for (i = 0; i < len; i++)
{
buf[len - 1 - i] = 'a' + (ident % 26);
ident /= 26;
}
buf[len] = '\0';
printf("%s", buf);
}
int binsearch(entry *entries, uint32_t n, uint32_t target, uint32_t *first, uint32_t *last)
{
int low, high, mid, midhash;
low = 0;
high = n - 1;
while (low <= high)
{
mid = low + ((high - low)/2);
midhash = entries[mid].hash;
if (midhash < target)
low = mid + 1;
else if (midhash > target)
high = mid - 1;
else
{
for (low = mid - 1; low >= 0 && entries[low].hash == target; low--)
;
*first = low + 1;
for (high = mid + 1; high < n && entries[high].hash == target; high++)
;
*last = high - 1;
return 1;
}
}
return 0;
}
uint32_t hash(char *str)
{
uint32_t h;
h = 0;
for (; *str; str++)
h = (h << 5) - h + (uint32_t)(unsigned char)(*str);
return h;
}
/* searches through all lower-case options only */
void search(int len, uint32_t target)
{
char str[31];
int i, halflen;
entry *entries;
uint32_t nentries, pos, pow31;
/* For now, assume an even length */
halflen = len / 2;
nentries = 1;
for (i = 0; i < halflen; i++)
nentries *= 26;
entries = malloc(nentries * sizeof(entry));
pow31 = 1;
for (i = 0; i < halflen; i++)
pow31 *= 31;
for (i = 0; i < halflen; i++)
str[i] = 'a';
str[halflen] = '\0';
str[halflen-1]--; /* simplify loop iteration */
for (pos = 0; pos < nentries; pos++)
{
for (i = halflen - 1; str[i] == 'z'; i--)
str[i] = 'a';
str[i]++;
entries[pos].hash = hash(str);
entries[pos].ident = pos;
}
qsort(entries, nentries, sizeof(entry), entry_cmp);
for (pos = 0; pos < nentries; pos++)
{
uint32_t htarget, first, last, j;
htarget = target - pow31*entries[pos].hash;
if (binsearch(entries, nentries, htarget, &first, &last))
{
for (j = first; j <= last; j++)
{
print_ident(entries[pos].ident, halflen);
print_ident(entries[j].ident, halflen);
printf("\n");
}
}
}
free(entries);
}
int main()
{
search(10, 666);
return 0;
}
Program 3 (Magma, searches all strings of ten lowercase letters, search by k)
tostr := func<S | &cat [ CodeToString(x) : x in Reverse(S) ] >;
search := function(n, target)
allones := (31^n - 1) div (31 - 1);
minv := 97*allones - target;
maxv := 122*allones - target;
mink := Ceiling(minv / 2^32);
maxk := Floor(maxv / 2^32);
vals := [];
for k in [mink..maxk] do
S := Intseq(2^32*k - minv, 31, n);
if &or [ x gt 25 : x in S ] then continue; end if;
Append(~vals, tostr([ x + 97 : x in S ]));
end for;
return vals;
end function;
search(10, 666);