Java Specialists' Java Training Europehome of the java specialists' newsletter

The Java Specialists' Newsletter
Issue 1912011-04-27 Category: Performance Java version: Java 6

GitHub Subscribe Free RSS Feed

Delaying Garbage Collection Costs

by Dr. Heinz M. Kabutz
Abstract:
The garbage collector can cause our program to slow down at inappropriate times. In this newsletter we look at how we can delay the GC to never run and thus produce reliable results. Then, instead of actually collecting the dead objects, we simply shut down and start again.

Welcome to the 191st issue of The Java(tm) Specialists' Newsletter, sent to you from the stunning Island of Crete. A Greek friend remarked on Sunday how surprised he was at my recent progress with the Greek language. Michaelis had been encouraging me for the last 4 years to knuckle down and put in a bit of an effort.

Learning a new language requires a good memory, which has never been my strong point. I discovered in grade 8 history that I hardly had any short-term memory at all. That, coupled with an illegible scrawl, were serious handicaps in school. As a result, my grades were never particularly good. I was also overconfident and seldom read questions properly before writing my answer. The only time I ever got a distinction in school was for my final exam. It is funny that the rest of my 12 years of school do not count for anything. Nobody was ever interested in how many PT classes I bunked or that I failed "hand writing" in grade 5, all they wanted to know was my final score.

Greek is not the first language I learned. In high school, I decided to learn an African language called isiXhosa, spoken by our former president Nelson Mandela. This was in the years of apartheid, when it was extremely unfashionable for a melanin deprived individual to learn an African language. In my class, I was the most fluent speaker. My secret? I would walk home from school and speak to anybody who cared to listen. I made a lot of mistakes and most of the time I spoke about the same things. "Where are you from? Where are you going? It is hot today". This loosened my tongue and gave me confidence. I do the same with Greek.

My second secret was that I wrote a computer program in GWBasic to help me remember the vocabulary. The program would show me a word and then I would have to write it in isiXhosa. If I made a mistake, I would get punished by having to type it three times. I wrote a similar Java program for Greek, which was challenging because of the strange character set. What surprises me is how this program flashes the words into my medium term memory, without first going into the short-term memory. Even though I don't actively remember the word, it is there in my brain and I can recall it, as long as I do not think too hard. Another surprise is that I am finding it as difficult (or as easy) to learn Greek as I found it to learn isiXhosa, when I was 13 years old. Some expats tell me that they are too old to learn. But it is just as hard for me to remember anything as when I was 13. It is a major effort for me to recall even a single word.

So, in conclusion of this very long opening, if you are finding it hard learning new languages, computer or natural, you are in good company. It is hard for me too. The trick is to use the language, then eventually something will stick :-) (Oh and just for interest, "use" in Greek is xrhsimopoiw :-))

Delaying Garbage Collection Costs

A few weeks ago, our French Java Specialist Master Course instructor sent us an interesting puzzle from his brilliant The Coder's Breakfast blog. I love the name, as it evokes good French coffee and a croissant whilst puzzling over the latest Java quiz.

I will rephrase his question a bit, because I think something might have gotten lost in the translation from French into English. With minimal Java code changes, how can you make the Quiz44 class run at least 10x faster?

public class Quiz44 {
  private static final int NB_VALUES = 10000000;
  private static final int MAX_VALUE = 1000;
  private static final int NB_RUNS = 10;

  public static void main(String[] args) {
    Integer[] boxedValues = new Integer[NB_VALUES];
    int[] values = initValues();

    System.out.println("Benchmarking...");
    for (int run = 1; run <= NB_RUNS; run++) {
      long t1 = System.currentTimeMillis();
      for (int i = 0; i < NB_VALUES; i++) {
        boxedValues[i] = values[i];
      }
      long t2 = System.currentTimeMillis();
      System.out.printf("Run %2d : %4dms%n", run, t2 - t1);
    }
  }

  /**
   * Nothing important here, just values init.
   */
  private static int[] initValues() {
    System.out.println("Generating values...");
    int[] values = new int[NB_VALUES];
    Random random = new Random();
    for (int i = 0; i < values.length; i++) {
      values[i] = random.nextInt(MAX_VALUE);
    }
    return values;
  }
}
  

When I run this on my MacBook Pro, with -Xmx500m, I see the following results:

    Generating values...
    Benchmarking...
    Run  1 : 1657ms
    Run  2 : 2879ms
    Run  3 : 2265ms
    Run  4 : 2217ms
    Run  5 : 2211ms
    Run  6 : 2216ms
    Run  7 :  930ms
    Run  8 : 2216ms
    Run  9 : 2241ms
    Run 10 : 2216ms    
  

The average is 2105, with a variance of 254,000, caused by Run 7.

There are some obvious changes to the code that would make it complete faster. For example, we could reduce the number of NB_VALUES or MAX_VALUE fields, but I think Olivier would consider that as cheating. However, there is also a non-obvious change we could make.

For example, we could add a System.gc() before the first call to System.currentTimeMillis(). In that case, the test runs with the following values:

    Generating values...
    Benchmarking...
    Run  1 : 1720ms
    Run  2 : 2645ms
    Run  3 : 2240ms
    Run  4 :  927ms
    Run  5 :  918ms
    Run  6 :  930ms
    Run  7 :  965ms
    Run  8 :  914ms
    Run  9 :  903ms
    Run 10 :  913ms
  

This time the average is 1308 with a variance of 430,000. Not a bad speedup at all. After all, it is now 38% faster than before. It should be obvious that a large proportion of time is wasted with collecting all these unnecessary objects. If we run the code again with -verbose:gc on, we can see the costs involved:

    Generating values...
    [GC 40783K->39473K(83008K), 0.0041315 secs]
    [Full GC 39473K->39465K(83008K), 0.0696936 secs]
    [GC 78528K(123980K), 0.0002081 secs]
    [GC 79209K(123980K), 0.0002467 secs]
    Benchmarking...
    [Full GC 79197K->78518K(150000K), 0.0828044 secs]
    [GC 78518K(150000K), 0.0001254 secs]
    [GC 78995K(150000K), 0.0006032 secs]
    [GC 95542K->103000K(150000K), 0.3685923 secs]
    [GC 103000K(150000K), 0.0021857 secs]
    [GC 120024K->129617K(150000K), 0.1627059 secs]
    [GC 146641K->155227K(172464K), 0.1826291 secs]
    [GC 172251K->180831K(198000K), 0.1499428 secs]
    [GC 197855K->206365K(223536K), 0.1794985 secs]
    [GC 223389K->231900K(249072K), 0.1751786 secs]
    [GC 248924K->257435K(274800K), 0.1594760 secs]
    [GC 274459K->282969K(300144K), 0.2015765 secs]
    Run  1 : 1774ms
    [Full GC 283309K->282255K(300144K), 0.8866413 secs]
    [GC 315119K->330066K(506684K), 0.3946753 secs]
    [GC 364178K->398754K(506684K), 0.3282639 secs]
    [GC 398754K(506684K), 0.0043726 secs]
    [GC 432866K->449971K(506684K), 0.3649566 secs]
    [Full GC 484083K->282879K(506684K), 1.1812640 secs]
    Run  2 : 2708ms
    [Full GC 284935K->282881K(507776K), 1.0651874 secs]
    [GC 316993K->345922K(507776K), 0.2532635 secs]
    [GC 380034K->399611K(507776K), 0.2922708 secs]
    [GC 400297K(507776K), 0.0042360 secs]
    [GC 433723K->450807K(507776K), 0.3415709 secs]
    [Full GC 484919K->282884K(507776K), 1.0057979 secs]
    Run  3 : 2324ms
    [Full GC 283571K->282884K(507776K), 0.8885789 secs]
    [GC 316996K->347295K(507776K), 0.2598463 secs]
    [GC 381407K->400631K(507776K), 0.3051485 secs]
    [GC 401318K(507776K), 0.0042195 secs]
    [GC 434743K->451850K(507776K), 0.3479674 secs]
    Run  4 : 1024ms
    [Full GC 485962K->282884K(507776K), 1.0040985 secs]
    [GC 316996K->347295K(507776K), 0.2591832 secs]
    [GC 381407K->400631K(507776K), 0.2888177 secs]
    [GC 401318K(507776K), 0.0042504 secs]
    [GC 434743K->451850K(507776K), 0.3348886 secs]
    Run  5 :  994ms
    [Full GC 485962K->282884K(507776K), 1.0128758 secs]
    [GC 316996K->347295K(507776K), 0.2580010 secs]
    [GC 381407K->400631K(507776K), 0.2884526 secs]
    [GC 402692K(507776K), 0.0060617 secs]
    [GC 434743K->451848K(507776K), 0.3290486 secs]
    Run  6 : 1004ms
    [Full GC 485960K->282884K(507776K), 1.0040235 secs]
    [GC 316996K->347295K(507776K), 0.2596790 secs]
    [GC 381407K->400631K(507776K), 0.2851338 secs]
    [GC 401318K(507776K), 0.0042191 secs]
    [GC 434743K->451840K(507776K), 0.3340752 secs]
    Run  7 :  989ms
    [Full GC 485952K->282884K(507776K), 1.0022637 secs]
    [GC 316996K->347295K(507776K), 0.2612456 secs]
    [GC 381407K->400631K(507776K), 0.2933666 secs]
    [GC 401318K(507776K), 0.0043201 secs]
    [GC 434743K->451842K(507776K), 0.3280430 secs]
    Run  8 :  997ms
    [Full GC 485954K->282884K(507776K), 1.0126833 secs]
    [GC 316996K->347295K(507776K), 0.2569432 secs]
    [GC 381407K->400631K(507776K), 0.2866691 secs]
    [GC 401318K(507776K), 0.0042206 secs]
    [GC 434743K->451842K(507776K), 0.3418867 secs]
    Run  9 : 1000ms
    [Full GC 485954K->282884K(507776K), 1.0074827 secs]
    [GC 316996K->347295K(507776K), 0.2594386 secs]
    [GC 381407K->400631K(507776K), 0.2966164 secs]
    [GC 401318K(507776K), 0.0042397 secs]
    [GC 434743K->451840K(507776K), 0.3391696 secs]
    Run 10 : 1009ms
    [Full GC 485952K->400K(507776K), 0.3041337 secs]
  

The Full GC's in italics are the artificial GCs that we introduced to make our test run faster. If we exclude them from the timing, we can work out how much of each run is spent collecting garbage:

    Run  GC    Program  Total
     1   1583  191      1774
     2   2274  434      2708
     3   1897  427      2324
     4    917  107      1024
     5    887  107       994
     6    882  122      1004
     7    883  106       989
     8    887  110       997
     9    890  110      1000
    10    899  101      1009
  

We can get even better figures from our program if instead of elapsed time (System.currentTimeMillis()), we use user CPU time (ThreadMXBean.getCurrentThreadUserTime()). Here are the run times:

    Generating values...
    Benchmarking...
    Run  1 : 1667ms  106ms
    Run  2 : 2576ms  114ms
    Run  3 : 2199ms  109ms
    Run  4 :  897ms  109ms
    Run  5 :  899ms  109ms
    Run  6 :  899ms  109ms
    Run  7 :  903ms  109ms
    Run  8 :  889ms  109ms
    Run  9 :  892ms  109ms
    Run 10 :  896ms  108ms    
  

Our average user CPU time is now 109ms with a variance of 4.

Let's get back to the problem. How do we make the test run 10 times faster with minimal code changes? My first approach was to simply delay the cost of GC until after the test has run through. Since I have 8GB of memory in my laptop, I can beef up the initial and maximum size of the heap to a huge amount. I also set the NewSize to a crazy amount, so that the GC never needs to run whilst the test is running.

    $ java -showversion -Xmx10g -Xms10g -XX:NewSize=9g Quiz44

    java version "1.6.0_24"
    Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
    Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)

    Generating values...
    Benchmarking...
    Run  1 :  198ms
    Run  2 :  203ms
    Run  3 :  179ms
    Run  4 :  192ms
    Run  5 :  179ms
    Run  6 :  191ms
    Run  7 :  183ms
    Run  8 :  192ms
    Run  9 :  180ms
    Run 10 :  197ms
  

This gave me an average of 189ms with a slightly high variance of 75. Since my initial results had an average of 2105, I achieved Olivier's target of being at least 10x faster. In case you think what I've done is really stupid, I know of companies who use this trick to avoid all GC pauses. They try to never construct objects. They make the young generation very large. After running for several hours, they quit the application and start again. This gives them very predictable run times. So it's not as harebrained an idea as it appears at first.

In our case, things start to go wrong when we increase the number of runs, for example, let's change NB_RUNS to 100:

    java version "1.6.0_24"
    Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
    Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)

    Generating values...
    Benchmarking...
    Run  1 :  188ms
    Run  2 :  214ms
    Run  3 :  234ms
    Run  4 :  191ms
    Run  5 :  203ms
    Run  6 :  193ms
    Run  7 :  207ms
    Run  8 :  188ms
    Run  9 :  208ms
    Run 10 :  183ms
    Run 11 :  199ms
    Run 12 :  183ms
    Run 13 :  200ms
    Run 14 :  186ms
    Run 15 :  202ms
    Run 16 :  192ms
    Run 17 :  216ms
    Run 18 :  191ms
    Run 19 :  211ms
    Run 20 :  188ms
    Run 21 :  210ms
    Run 22 :  190ms
    Run 23 :  210ms
    Run 24 :  187ms
    Run 25 :  211ms
    Run 26 :  186ms
    Run 27 :  208ms
    Run 28 :  188ms
    Run 29 :  205ms
    Run 30 :  188ms
    Run 31 :  216ms
    Run 32 :  189ms
    Run 33 :  210ms
    Run 34 :  186ms
    Run 35 :  215ms
    Run 36 :  190ms
    Run 37 :  222ms
    Run 38 :  191ms
    Run 39 :  209ms
    Run 40 :  186ms
    Run 41 : 1068ms
    Run 42 : 7281ms
    Run 43 : 8705ms
    Run 44 : 8583ms
    Run 45 : 7675ms
    Run 46 : 6346ms
    Run 47 : 1625ms
    Run 48 :  896ms
    Run 49 :  834ms
    Run 50 :  948ms
    Run 51 :  554ms
    Run 52 :  901ms
    Run 53 :  904ms
    Run 54 : 4253ms
    Run 55 :  815ms
    Run 56 :  111ms
    Run 57 :  127ms
    Run 58 :  143ms
    Run 59 :  159ms
    Run 60 :  117ms
    Run 61 :  138ms
    Run 62 :  136ms
    Run 63 :  140ms
    Run 64 :  128ms
    Run 65 :  131ms
    Run 66 :  136ms
    Run 67 :  147ms
    Run 68 :  122ms
    Run 69 :  127ms
    Run 70 :  135ms
    Run 71 :  135ms
    Run 72 :  135ms
    Run 73 :  134ms
    Run 74 :  126ms
    Run 75 :  125ms
    Run 76 :  217ms
    Run 77 :  765ms
    Run 78 : 2973ms
    Run 79 : 2459ms
    Run 80 : 2180ms
    Run 81 : 4733ms
    Run 82 : 7484ms
    Run 83 : 3177ms
    Run 84 : 6209ms
    Run 85 : 5129ms
    Run 86 : 1051ms
    Run 87 : 5177ms
    Run 88 : 5515ms
    Run 89 : 6217ms
    Run 90 : 8865ms
    Run 91 : 7981ms
    Run 92 : 3578ms
    Run 93 : 3472ms
    Run 94 : 3645ms
    Run 95 : 4006ms
    Run 96 : 3933ms
    Run 97 : 4211ms
    Run 98 : 3127ms
    Run 99 : 3714ms
    Run 100 : 3811ms
  

This gives us an average of 1656 and variance of 6,000,000. There are also times when the system is almost unusable because of the time wasted in GC.

One of our Java Specialist Club members, Marek Bickos, quickly discovered a flag that made this code run much faster. By simply invoking the code with -XX:+AggressiveOpts, it caused no GC at all:

    Generating values...
    Benchmarking...
    Run  1 :   56ms
    Run  2 :   45ms
    Run  3 :   39ms
    Run  4 :   50ms
    Run  5 :   66ms
    Run  6 :   39ms
    Run  7 :   39ms
    Run  8 :   40ms
    Run  9 :   46ms
    Run 10 :   46ms
  

A bit of experimentation explained why this was so much faster. In modern JVMs, the autoboxing cache size is configurable. It used to be that all values from -128 to 127 were cached, but nowadays we can specify a different upper bound. When we run the code with -XX:+AggressiveOpts, it simply increases this value. We can try it out with this little program:

public class FindAutoboxingUpperBound {
  public static void main(String[] args) {
    int i=0;
    while(isSame(i,i)) {
      i++;
    }
    System.out.println("Upper bound is " + (i - 1));
    i = 0;
    while(isSame(i,i)) {
      i--;
    }
    System.out.println("Lower bound is " + (i + 1));
  }

  private static boolean isSame(Integer i0, Integer i1) {
    return i0 == i1;
  }
}    
  

If we run this without any flags, it comes back as:

    Upper bound is 127
    Lower bound is -128
  

With -XX:+AggressiveOpts, we get these results:

    Upper bound is 20000
    Lower bound is -128
  

It is thus pretty obvious why it is so much faster. Instead of constructing new objects, it is simply getting them from the Integer cache.

If we increase the MAX_VALUE to 100000, and run the program with -XX:+AggressiveOpts and -Xmx500m, we get these results:

    Generating values...
    Benchmarking...
    Run  1 : 1622ms
    Run  2 : 1351ms
    Run  3 : 2675ms
    Run  4 : 2037ms
    Run  5 : 2322ms
    Run  6 : 2325ms
    Run  7 : 2195ms
    Run  8 : 2394ms
    Run  9 : 2264ms
    Run 10 : 2109ms
  

However, with my flags of "-Xmx10g -Xms10g -XX:NewSize=9g", it is still as fast as before.

The question is thus, which is the better solution? I think it depends very much on how this is going to be used in the real world. My solution would work with any objects that are being made, not just Integers. However, we also saw the serious downside to my approach. Once we do fill up our young space, the JVM will get very slow, at least for a while.

The solution Olivier was looking for was to set the maximum number of values in the integer cache using the system parameter java.lang.Integer.IntegerCache.high, for example:

    java -Djava.lang.Integer.IntegerCache.high=1000 Quiz44

    Generating values...
    Benchmarking...
    Run  1 :   40ms
    Run  2 :   54ms
    Run  3 :   48ms
    Run  4 :   49ms
    Run  5 :   66ms
    Run  6 :   48ms
    Run  7 :   48ms
    Run  8 :   49ms
    Run  9 :   48ms
    Run 10 :   66ms
  

The -XX:AutoBoxCacheMax=1000 flag would do the same thing. Whilst it is possible to specify the maximum Integer in the cache, no doubt to satisfy some silly microbenchmark to prove that Java is at least as fast as Scala, it is currently not possible to set the minimum Integer.

We talk about this and many other tricks of the trade in our Java Specialists Master Course on the Island of Crete on the 7th of June 2011...

Heinz

Performance Articles Related Java Course

Java Master
Java Concurrency
Design Patterns
In-House Courses



© 2010-2014 Heinz Kabutz - All Rights Reserved Sitemap
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. JavaSpecialists.eu is not connected to Oracle, Inc. and is not sponsored by Oracle, Inc.
@CORE_THE_BAND #RBBJGR