Running on Java 22-ea+27-2262 (Preview)
Home of The JavaSpecialists' Newsletter

262How Java Maps Protect Themselves from DOS Attacks

Author: Dr. Heinz M. KabutzDate: 2018-10-09Java Version: 8Category: Tips and Tricks
 

Abstract: A well known exploit with maps is to create a lot of Strings as keys that all have the same hash code. In this newsletter we see how easily that can be done and what Java 8+ HashMap does to protect itself.

 

Welcome to the 262nd edition of The Java(tm) Specialists' Newsletter. When I announced my 15000 push-ups challenge last month, I expected a rolling of the eyes and perhaps a dozen Java geeks to join me. I also thought that bronze status would be a piece of cake for me, having done a fair number of push-ups over the years, though not many this year. I was wrong on both counts. First off, we now have over 700 programmers hitting the floor every day, trying to push the world away from them. The biggest challenge will be for us to keep our energy up for a bone crunching five months! Secondly, I got a spasm in my quadratus lumborum muscle whilst walking in the kitchen getting ready for the school run, with such resulting agony that I could only barely hobble along with crutches. I am way behind in the challenge, but hope to catch up soon. I've also had to retire my current running streak after 672 consecutive days. Starting my next streak as soon as I get the green light from my physio.

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

How Java Maps Protect Themselves from DOS Attacks

Back in the days of Java 1.2, a book was published called "Java 2 Performance and Idiom Guide". It was a combination of performance tips and a style guide. A bit like what Effective Java eventually became. A large part of its argument was that we should never use String directly as a key in a HashMap, but rather wrap it with a class and then cache the hash code. I remember reading this and discussing it with a colleague. After much debate, we concluded that caching a hash code for a key would not achieve all that much, since it is cached inside the hash map anyway. When we call put() and get() we usually create new key objects to do so and discard them afterwards. If we did hold onto the keys, we could just as well hold onto the values. I'm sure that there are specific use cases where caching a hash code can give you performance gains, but I don't think that in general it is useful.

Let me take us two steps back to Java 1.0. In those days, the hashCode() function of String was O(1). It required at most 16 calculations. If the String was larger than 16 characters, they would pick 8 of the characters and use those to generate a hash code. As you can imagine, the hash codes were not necessarily distributed very well. In Java 1.2, they changed the algorithm to use the entire String for the calculation and thus it became O(n). The longer the String, the more we had to wait for hashCode() to complete. This is also where the debate came from that one should cache hash codes, even though it was probably a false economy. Incidentally, changing the hash code in String going from Java 1.1 to 1.2 caused a lot of tests to break. Also, persistent hash tables had a challenge moving over between the versions as the hash of Strings now pointed to a different entry. It was the last time that Sun Microsystems / Oracle dared change the hash code of String.

Enter Java 1.3. Whether the Sun engineers had been convinced by the book mentioned earlier or some microbenchmarks, they decided to cache the hash code inside String itself. One unfortunate side effect of Java's popularity is that they were too scared to add an obvious protection against nasty attacks. If the hash code was 0, it would calculate it, otherwise simply return it. It would have been better to use a different number as a default though, because 0 is a particularly bad default, as we will see. Anything else is better. [ed - Unfortunately 0 is the only reasonable default for hash. Anything else could produce a data race and would add to the initialization cost of all Strings, not just those where hashCode() was called.] The hash code method structure looks something like this at the moment:

public int hashCode() {
  int h = hash;
  if (h == 0 && value.length > 0) {
    // calculate hash code and assign it to hash and h
  }
  return h;
}

The actual calculation for the hash code works like this:

for (int i = 0; i < value.length; i++) {
  h = 31 * h + val[i];
}

Since Java 9 it is a bit more complicated due to compact strings, but the end result will be the same. For example, the String "ABC" will have hash code of (('A' * 31) + 'B') * 31 + 'C'. We can see that easily in jshell:

jshell> (('A'*31)+'B')*31+'C'
$1 ==> 64578

jshell> "ABC".hashCode()
$2 ==> 64578

jshell> (((((('A'*31)+'B')*31+'C')*31+'D')*31+'E')*31+'F')*31+'G'
$3 ==> -488308668

jshell> "ABCDEFG".hashCode()
$4 ==> -488308668

jshell> "AAAAAAA".hashCode()
$5 ==> -518878239

jshell> "zzzzzzz".hashCode()
$6 ==> 215481018

Note how "AAAAAAA"'s hashCode is negative, but "zzzzzzz"'s is positive. If we explore all the 7-character Strings in alphabet [A-Za-z], we are likely to find some where this cancels out and the hash code equals 0.

The significance of String with hash code of 0 is two-fold. The first is minor, the second more serious. If we find a String with hash 0, we can combine it with the same String and it will still be zero. We can make it as long as we want and it will always have a hash code of 0. For example, here is such a String: "ASDZguv". Here is how we can combine it to make an arbitrary long String:

jshell> "ASDZguv".hashCode()
$1 ==> 0

jshell> "ASDZguvASDZguv".hashCode()
$2 ==> 0

jshell> "ASDZguvASDZguvASDZguvASDZguv".hashCode()
$3 ==> 0

Taken to the extreme, we could create a String of almost 2 billion characters long:

public class VeryLongString {
  public static void main(String... args) {
    StringBuilder sb = new StringBuilder(1_999_999_995);
    for (int i = 0; i < 285_714_285; i++) {
      sb.append("ASDZguv");
    }
    String s = sb.toString();
    for (int i = 0; i < 10; i++) {
      long time = System.nanoTime();
      try {
        System.out.println(s.hashCode());
      } finally {
        time = System.nanoTime() - time;
        System.out.printf("time = %dms%n", (time / 1_000_000));
      }
    }
  }
}

On my machine, each call to hashCode() takes about two seconds! However, creating that long String takes even longer and would not be a very effective approach to launch an attack against a hash map. The hash map itself would protect itself by caching the hash codes internally with the entries. Thus hashCode() is called once on put() and then again on get().

Far more serious would be to take several Strings with hash code of 0 and to then combine them. Since the resulting hash code is also 0, we can make an arbitrary number of these. It's the quantity of Strings with the same hash code that produces the biggest danger, not the length of one String.

Joe Darcy showed a clever algorithm to create Strings with any target hash code in his article Unhashing a String, by running the algorithm in reverse. It gets a bit more tricky if we want to create a String with a particular alphabet, for example with only characters in the range [A-Za-z]. I have seen some approaches that create billions of String objects and then call their hashCode(), hoping to find one by brute force that has the value of 0. Here is my recursive approach that works well and searches through the Strings of length 7 with the alphabet [A-Za-z]. If you're wondering why I fill the byte[] with '!', this is as a check to make sure that all the array elements were filled in as expected. The parameter "target" is the hash code we want to find, in this case 0:

public abstract class BruteForceBase {
  protected static final byte[] alphabet = new byte[26 * 2];

  static {
    for (int i = 0; i < 26; i++) {
      alphabet[i] = (byte) ('A' + i);
      alphabet[i + 26] = (byte) ('a' + i);
    }
    System.out.println("alphabet = " + new String(alphabet));
  }
}
import java.util.*;

public class BruteForceRecursive extends BruteForceBase {
  public static void main(String... args) {
    byte[] is = new byte[7];
    Arrays.fill(is, (byte) '!');
    check(0, 0, is, 0);
  }

  static void check(int depth, int h, byte[] is, int target) {
    if (depth == 7) {
      if (h == target) {
        System.out.println(new String(is));
      }
      return;
    }

    for (byte i : alphabet) {
      is[depth] = i;
      check(depth + 1, h * 31 + i, is, target);
    }
  }
}

This is fairly quick and runs through about 10 billion hash codes in 40 seconds on my machine. Using nested loops instead of recursion is uglier, but I get through the same work in just 26 seconds. Here is the nested loop version:

public class BruteForceNestedLoops extends BruteForceBase {
  public static void main(String... args) {
    int target = 0; // our target hash code
    for (byte i0 : alphabet) {
      int h0 = i0;
      for (byte i1 : alphabet) {
        int h1 = h0 * 31 + i1;
        for (byte i2 : alphabet) {
          int h2 = h1 * 31 + i2;
          for (byte i3 : alphabet) {
            int h3 = h2 * 31 + i3;
            for (byte i4 : alphabet) {
              int h4 = h3 * 31 + i4;
              for (byte i5 : alphabet) {
                int h5 = h4 * 31 + i5;
                for (byte i6 : alphabet) {
                  int h6 = h5 * 31 + i6;
                  if (h6 == target) {
                    byte[] is = {i0, i1, i2, i3, i4, i5, i6};
                    System.out.println(new String(is));
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

We can of course parallelize either of these algorithms. To parallelize the recursive version, use RecursiveTask. For the nested loop we can simply spread the outer loop over a bunch of tasks. Unless we have more than 52 processors, that should give us good utilization. For example:

import java.util.concurrent.*;

public class BruteForceNestedLoopsParallel extends BruteForceBase {
  public static void main(String... args) {
    ThreadFactory factory = Executors.defaultThreadFactory();
    ExecutorService pool = Executors.newFixedThreadPool(
        Runtime.getRuntime().availableProcessors(),
        r -> {
          // always give threads descriptive names
          Thread t = factory.newThread(r);
          t.setName(t.getName() + "#bruteforce");
          return t;
        }
    );
    int target = 0; // our target hash code
    for (byte i0 : alphabet) {
      pool.submit(() -> {
        int h0 = i0;
        for (byte i1 : alphabet) {
          int h1 = h0 * 31 + i1;
          for (byte i2 : alphabet) {
            int h2 = h1 * 31 + i2;
            for (byte i3 : alphabet) {
              int h3 = h2 * 31 + i3;
              for (byte i4 : alphabet) {
                int h4 = h3 * 31 + i4;
                for (byte i5 : alphabet) {
                  int h5 = h4 * 31 + i5;
                  for (byte i6 : alphabet) {
                    int h6 = h5 * 31 + i6;
                    if (h6 == target) {
                      byte[] is = {i0, i1, i2, i3, i4, i5, i6};
                      System.out.println(new String(is));
                    }
                  }
                }
              }
            }
          }
        }
      });
    }
    pool.shutdown();
    // once all threads are done, we will naturally exit
  }
}

If we run these for a while, we should find several dozen Strings all with the hash code of 0. We can also use the same mechanism to find Strings with any hash code that we want. Just change the value of "target" to whatever you are looking for.

As mentioned before, if s1.hashCode() == 0 and s2.hashCode() == 0, then (s1 + s2).hashCode() == 0. This is not true for other hash codes. For example:

jshell> var s1 = "ARcZgxC"
s1 ==> "ARcZgxC"

jshell> var s2 = "ARcZhXb"
s2 ==> "ARcZhXb"

jshell> s1.hashCode()
$1 ==> 42

jshell> s2.hashCode()
$2 ==> 42

jshell> (s1 + s2).hashCode()
$3 ==> 183590080

How can we abuse this feature of 0 hash codes? Here is a StringDOS class that takes a bunch of Strings and combines them, adding them all to one map. I have kept the source Java 6 compatible, so we can run it with that older JVM:

import java.util.*;

public class StringDOS {
  private static final String[] zeroHashCodes = {
      "ARbyguv", "ARbygvW", "ARbyhVv", "ARbyhWW",
      "ARbzHuv", "ARbzHvW", "ARbzIVv", "ARbzIWW",
      "ARcZguv", "ARcZgvW", "ARcZhVv", "ARcZhWW",
      "ASCyguv", "ASCygvW", "ASCyhVv", "ASCyhWW",
      "ASCzHuv", "ASCzHvW", "ASCzIVv", "ASCzIWW",
      "ASDZguv", "ASDZgvW", "ASDZhVv", "ASDZhWW",
  };

  public static void main(String... args) {
    for (int i = 5; i < 24; i++) {
      System.out.println("Testing with size " + (1 << i));
      test(1 << i);
    }
  }

  private static void test(int size) {
    Map<String, Integer> map = new HashMap<String, Integer>();
    long time = System.nanoTime();
    try {
      for (int i = 0; i < size; i++) {
        StringBuilder sb = new StringBuilder();
        for (int j = 1, index = 0; j <= i; j <<= 1, index++) {
          if ((i & j) != 0)
            sb.append(zeroHashCodes[index % zeroHashCodes.length]);
        }
        map.put(sb.toString(), i);
      }
    } finally {
      time = System.nanoTime() - time;
      System.out.printf("creating time = %dms%n", (time / 1000000));
    }

    System.out.println("map.size() = " + map.size());
    for (String s : map.keySet()) {
      if (s.hashCode() != 0)
        throw new AssertionError("hashCode() of " + s + " is not 0");
    }
    System.out.println("All hashCode() were 0");
    String notInMap = zeroHashCodes[1] + zeroHashCodes[0];
    for (int repeats = 0; repeats < 10; repeats++) {
      testLookup(map, notInMap);
    }
  }

  private static void testLookup(Map<String, Integer> map, String notInMap) {
    long time = System.nanoTime();
    try {
      for (int i = 0; i < 1000 * 1000; i++) {
        map.get(notInMap);
      }
    } finally {
      time = System.nanoTime() - time;
      System.out.printf("time = %dms%n", (time / 1000000));
    }
  }
}

What is happening here? All of the Strings that we generate have the exact same hash code of 0. They thus all end up in the same bucket. In Java 6 this was a big issue. Since the bucket collisions ended up in a linked list, get() and put() both were linear in cost. If you run the code with Java 6, you will see figures like this:

java version "1.6.0_65"
Testing with size 128
time = 338ms
Testing with size 256
time = 1371ms
Testing with size 512
time = 3013ms
Testing with size 1024
time = 11896ms
Testing with size 2048
time = 26362ms
Testing with size 4096
time = 61376ms

How did I know this was linear? When the number of entries doubles, the time roughly doubles. For example, 4096 takes 2.32 times as long as 2048. Some other factors also come into play, such as the garbage collector and the memory layout. We would experience worse than 2x slowdown when we move from L2 to L3 cache for example.

Enter Java 7. Here they recognized that most HashMap keys in the world were Strings. They also saw that the biggest danger was when we could reliably create Strings with hash code of 0. So they created a second hash algorithm that could be used inside HashMap, called hash32(), which was randomized to avoid exploits like our class above. You would typically configure it with a threshold of 512 like so: -Djdk.map.althashing.threshold=512. It was not enabled by default, as the Strings would also be arranged in a different order inside the HashMap and this might have confused some test code that incorrecty expected it to remain the same. Since we could not deliberately cause bucket collisions, the Strings would be spread nicely throughout the table, giving us hopefully constant time lookup, although linear cost was still possible, but harder to force from outside the map. Here is the output with Java 7 with the althashing threshold set to 512:

java version "1.7.0_80" with -Djdk.map.althashing.threshold=512
Testing with size 64
time = 206ms
Testing with size 128
time = 412ms
Testing with size 256
time = 3ms
Testing with size 512
time = 3ms
Testing with size 1024
time = 3ms
    *snip*
Testing with size 65536
time = 4ms
Testing with size 1048576
time = 3ms
Testing with size 4194304
time = 3ms

In Java 8, they did away with the hash32() method in String and decided to rather solve the issue in a general way. If we have more than a certain number of collisions in one bucket, HashMap creates a binary tree of entries. This is the reason why I suggest to always implement Comparable when writing hashCode() and equals(). Here is the run with Java 8 (Java versions 9, 10 and 11 look much the same):

java version "1.8.0_181"
Testing with size 64
time = 80ms
Testing with size 128
time = 94ms
Testing with size 256
time = 109ms
Testing with size 512
time = 122ms
Testing with size 1024
time = 222ms
Testing with size 65536
time = 220ms
Testing with size 1048576
time = 277ms

We can see that the performance since Java 8 is much better than previous versions when we have a large number of bucket collisions. In the past it was O(n), now it is O(log n). It still takes a bit of time to call get() in Java 8, but at least we are safe from someone sending us loads of Strings with the exact same hash code.

Kind regards from Crete

Heinz

 

Comments

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Related Articles

Browse the Newsletter Archive

About the Author

Heinz Kabutz Java Conference Speaker

Java Champion, author of the Javaspecialists Newsletter, conference speaking regular... About Heinz

Superpack '23

Superpack '23 Our entire Java Specialists Training in one huge bundle more...

Free Java Book

Dynamic Proxies in Java Book
Java Training

We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.

Java Consulting

We can help make your Java application run faster and trouble-shoot concurrency and performance bugs...