Abstract: The garbage collector can cause our program to slow down at inappropriate times. In this newsletter we look at how we can delay the GC to never run and thus produce reliable results. Then, instead of actually collecting the dead objects, we simply shut down and start again.
Welcome to the 191st issue of The Java(tm) Specialists' Newsletter, sent to you from the stunning Island of Crete. A Greek friend remarked on Sunday how surprised he was at my recent progress with the Greek language. Michaelis had been encouraging me for the last 4 years to knuckle down and put in a bit of an effort.
Learning a new language requires a good memory, which has never been my strong point. I discovered in grade 8 history that I hardly had any short-term memory at all. That, coupled with an illegible scrawl, were serious handicaps in school. As a result, my grades were never particularly good. I was also overconfident and seldom read questions properly before writing my answer. The only time I ever got a distinction in school was for my final exam. It is funny that the rest of my 12 years of school do not count for anything. Nobody was ever interested in how many PT classes I bunked or that I failed "hand writing" in grade 5, all they wanted to know was my final score.
Greek is not the first language I learned. In high school, I decided to learn an African language called isiXhosa, spoken by our former president Nelson Mandela. This was in the years of apartheid, when it was extremely unfashionable for a melanin deprived individual to learn an African language. In my class, I was the most fluent speaker. My secret? I would walk home from school and speak to anybody who cared to listen. I made a lot of mistakes and most of the time I spoke about the same things. "Where are you from? Where are you going? It is hot today". This loosened my tongue and gave me confidence. I do the same with Greek.
My second secret was that I wrote a computer program in GWBasic to help me remember the vocabulary. The program would show me a word and then I would have to write it in isiXhosa. If I made a mistake, I would get punished by having to type it three times. I wrote a similar Java program for Greek, which was challenging because of the strange character set. What surprises me is how this program flashes the words into my medium term memory, without first going into the short-term memory. Even though I don't actively remember the word, it is there in my brain and I can recall it, as long as I do not think too hard. Another surprise is that I am finding it as difficult (or as easy) to learn Greek as I found it to learn isiXhosa, when I was 13 years old. Some expats tell me that they are too old to learn. But it is just as hard for me to remember anything as when I was 13. It is a major effort for me to recall even a single word.
So, in conclusion of this very long opening, if you are finding it hard learning new languages, computer or natural, you are in good company. It is hard for me too. The trick is to use the language, then eventually something will stick :-) (Oh and just for interest, "use" in Greek is xrhsimopoiw :-))
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
A few weeks ago, our French Java Specialist Master Course instructor sent us an interesting puzzle from his brilliant The Coder's Breakfast blog. I love the name, as it evokes good French coffee and a croissant whilst puzzling over the latest Java quiz.
I will rephrase his question a bit, because I think something might have gotten lost in the translation from French into English. With minimal Java code changes, how can you make the Quiz44 class run at least 10x faster?
public class Quiz44 { private static final int NB_VALUES = 10000000; private static final int MAX_VALUE = 1000; private static final int NB_RUNS = 10; public static void main(String... args) { Integer[] boxedValues = new Integer[NB_VALUES]; int[] values = initValues(); System.out.println("Benchmarking..."); for (int run = 1; run <= NB_RUNS; run++) { long t1 = System.currentTimeMillis(); for (int i = 0; i < NB_VALUES; i++) { boxedValues[i] = values[i]; } long t2 = System.currentTimeMillis(); System.out.printf("Run %2d : %4dms%n", run, t2 - t1); } } /** * Nothing important here, just values init. */ private static int[] initValues() { System.out.println("Generating values..."); int[] values = new int[NB_VALUES]; Random random = new Random(); for (int i = 0; i < values.length; i++) { values[i] = random.nextInt(MAX_VALUE); } return values; } }
When I run this on my MacBook Pro, with -Xmx500m, I see the following results:
Generating values...
Benchmarking...
Run 1 : 1657ms
Run 2 : 2879ms
Run 3 : 2265ms
Run 4 : 2217ms
Run 5 : 2211ms
Run 6 : 2216ms
Run 7 : 930ms
Run 8 : 2216ms
Run 9 : 2241ms
Run 10 : 2216ms
The average is 2105, with a variance of 254,000, caused by Run 7.
There are some obvious changes to the code that would make it complete faster. For example, we could reduce the number of NB_VALUES or MAX_VALUE fields, but I think Olivier would consider that as cheating. However, there is also a non-obvious change we could make.
For example, we could add a System.gc() before
the first call to System.currentTimeMillis().
In that case, the test runs with the following values:
Generating values...
Benchmarking...
Run 1 : 1720ms
Run 2 : 2645ms
Run 3 : 2240ms
Run 4 : 927ms
Run 5 : 918ms
Run 6 : 930ms
Run 7 : 965ms
Run 8 : 914ms
Run 9 : 903ms
Run 10 : 913ms
This time the average is 1308 with a variance of 430,000. Not a bad speedup at all. After all, it is now 38% faster than before. It should be obvious that a large proportion of time is wasted with collecting all these unnecessary objects. If we run the code again with -verbose:gc on, we can see the costs involved:
Generating values...
[GC 40783K->39473K(83008K), 0.0041315 secs]
[Full GC 39473K->39465K(83008K), 0.0696936 secs]
[GC 78528K(123980K), 0.0002081 secs]
[GC 79209K(123980K), 0.0002467 secs]
Benchmarking...
[Full GC 79197K->78518K(150000K), 0.0828044 secs]
[GC 78518K(150000K), 0.0001254 secs]
[GC 78995K(150000K), 0.0006032 secs]
[GC 95542K->103000K(150000K), 0.3685923 secs]
[GC 103000K(150000K), 0.0021857 secs]
[GC 120024K->129617K(150000K), 0.1627059 secs]
[GC 146641K->155227K(172464K), 0.1826291 secs]
[GC 172251K->180831K(198000K), 0.1499428 secs]
[GC 197855K->206365K(223536K), 0.1794985 secs]
[GC 223389K->231900K(249072K), 0.1751786 secs]
[GC 248924K->257435K(274800K), 0.1594760 secs]
[GC 274459K->282969K(300144K), 0.2015765 secs]
Run 1 : 1774ms
[Full GC 283309K->282255K(300144K), 0.8866413 secs]
[GC 315119K->330066K(506684K), 0.3946753 secs]
[GC 364178K->398754K(506684K), 0.3282639 secs]
[GC 398754K(506684K), 0.0043726 secs]
[GC 432866K->449971K(506684K), 0.3649566 secs]
[Full GC 484083K->282879K(506684K), 1.1812640 secs]
Run 2 : 2708ms
[Full GC 284935K->282881K(507776K), 1.0651874 secs]
[GC 316993K->345922K(507776K), 0.2532635 secs]
[GC 380034K->399611K(507776K), 0.2922708 secs]
[GC 400297K(507776K), 0.0042360 secs]
[GC 433723K->450807K(507776K), 0.3415709 secs]
[Full GC 484919K->282884K(507776K), 1.0057979 secs]
Run 3 : 2324ms
[Full GC 283571K->282884K(507776K), 0.8885789 secs]
[GC 316996K->347295K(507776K), 0.2598463 secs]
[GC 381407K->400631K(507776K), 0.3051485 secs]
[GC 401318K(507776K), 0.0042195 secs]
[GC 434743K->451850K(507776K), 0.3479674 secs]
Run 4 : 1024ms
[Full GC 485962K->282884K(507776K), 1.0040985 secs]
[GC 316996K->347295K(507776K), 0.2591832 secs]
[GC 381407K->400631K(507776K), 0.2888177 secs]
[GC 401318K(507776K), 0.0042504 secs]
[GC 434743K->451850K(507776K), 0.3348886 secs]
Run 5 : 994ms
[Full GC 485962K->282884K(507776K), 1.0128758 secs]
[GC 316996K->347295K(507776K), 0.2580010 secs]
[GC 381407K->400631K(507776K), 0.2884526 secs]
[GC 402692K(507776K), 0.0060617 secs]
[GC 434743K->451848K(507776K), 0.3290486 secs]
Run 6 : 1004ms
[Full GC 485960K->282884K(507776K), 1.0040235 secs]
[GC 316996K->347295K(507776K), 0.2596790 secs]
[GC 381407K->400631K(507776K), 0.2851338 secs]
[GC 401318K(507776K), 0.0042191 secs]
[GC 434743K->451840K(507776K), 0.3340752 secs]
Run 7 : 989ms
[Full GC 485952K->282884K(507776K), 1.0022637 secs]
[GC 316996K->347295K(507776K), 0.2612456 secs]
[GC 381407K->400631K(507776K), 0.2933666 secs]
[GC 401318K(507776K), 0.0043201 secs]
[GC 434743K->451842K(507776K), 0.3280430 secs]
Run 8 : 997ms
[Full GC 485954K->282884K(507776K), 1.0126833 secs]
[GC 316996K->347295K(507776K), 0.2569432 secs]
[GC 381407K->400631K(507776K), 0.2866691 secs]
[GC 401318K(507776K), 0.0042206 secs]
[GC 434743K->451842K(507776K), 0.3418867 secs]
Run 9 : 1000ms
[Full GC 485954K->282884K(507776K), 1.0074827 secs]
[GC 316996K->347295K(507776K), 0.2594386 secs]
[GC 381407K->400631K(507776K), 0.2966164 secs]
[GC 401318K(507776K), 0.0042397 secs]
[GC 434743K->451840K(507776K), 0.3391696 secs]
Run 10 : 1009ms
[Full GC 485952K->400K(507776K), 0.3041337 secs]
The Full GC's in italics are the artificial GCs that we introduced to make our test run faster. If we exclude them from the timing, we can work out how much of each run is spent collecting garbage:
Run GC Program Total
1 1583 191 1774
2 2274 434 2708
3 1897 427 2324
4 917 107 1024
5 887 107 994
6 882 122 1004
7 883 106 989
8 887 110 997
9 890 110 1000
10 899 101 1009
We can get even better figures from our program if instead of elapsed time (System.currentTimeMillis()), we use user CPU time (ThreadMXBean.getCurrentThreadUserTime()). Here are the run times:
Generating values...
Benchmarking...
Run 1 : 1667ms 106ms
Run 2 : 2576ms 114ms
Run 3 : 2199ms 109ms
Run 4 : 897ms 109ms
Run 5 : 899ms 109ms
Run 6 : 899ms 109ms
Run 7 : 903ms 109ms
Run 8 : 889ms 109ms
Run 9 : 892ms 109ms
Run 10 : 896ms 108ms
Our average user CPU time is now 109ms with a variance of 4.
Let's get back to the problem. How do we make the test run 10 times faster with minimal code changes? My first approach was to simply delay the cost of GC until after the test has run through. Since I have 8GB of memory in my laptop, I can beef up the initial and maximum size of the heap to a huge amount. I also set the NewSize to a crazy amount, so that the GC never needs to run whilst the test is running.
$ java -showversion -Xmx10g -Xms10g -XX:NewSize=9g Quiz44
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)
Generating values...
Benchmarking...
Run 1 : 198ms
Run 2 : 203ms
Run 3 : 179ms
Run 4 : 192ms
Run 5 : 179ms
Run 6 : 191ms
Run 7 : 183ms
Run 8 : 192ms
Run 9 : 180ms
Run 10 : 197ms
This gave me an average of 189ms with a slightly high variance of 75. Since my initial results had an average of 2105, I achieved Olivier's target of being at least 10x faster. In case you think what I've done is really stupid, I know of companies who use this trick to avoid all GC pauses. They try to never construct objects. They make the young generation very large. After running for several hours, they quit the application and start again. This gives them very predictable run times. So it's not as harebrained an idea as it appears at first.
In our case, things start to go wrong when we increase the number of runs, for example, let's change NB_RUNS to 100:
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07-334-10M3326)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02-334, mixed mode)
Generating values...
Benchmarking...
Run 1 : 188ms
Run 2 : 214ms
Run 3 : 234ms
Run 4 : 191ms
Run 5 : 203ms
Run 6 : 193ms
Run 7 : 207ms
Run 8 : 188ms
Run 9 : 208ms
Run 10 : 183ms
Run 11 : 199ms
Run 12 : 183ms
Run 13 : 200ms
Run 14 : 186ms
Run 15 : 202ms
Run 16 : 192ms
Run 17 : 216ms
Run 18 : 191ms
Run 19 : 211ms
Run 20 : 188ms
Run 21 : 210ms
Run 22 : 190ms
Run 23 : 210ms
Run 24 : 187ms
Run 25 : 211ms
Run 26 : 186ms
Run 27 : 208ms
Run 28 : 188ms
Run 29 : 205ms
Run 30 : 188ms
Run 31 : 216ms
Run 32 : 189ms
Run 33 : 210ms
Run 34 : 186ms
Run 35 : 215ms
Run 36 : 190ms
Run 37 : 222ms
Run 38 : 191ms
Run 39 : 209ms
Run 40 : 186ms
Run 41 : 1068ms
Run 42 : 7281ms
Run 43 : 8705ms
Run 44 : 8583ms
Run 45 : 7675ms
Run 46 : 6346ms
Run 47 : 1625ms
Run 48 : 896ms
Run 49 : 834ms
Run 50 : 948ms
Run 51 : 554ms
Run 52 : 901ms
Run 53 : 904ms
Run 54 : 4253ms
Run 55 : 815ms
Run 56 : 111ms
Run 57 : 127ms
Run 58 : 143ms
Run 59 : 159ms
Run 60 : 117ms
Run 61 : 138ms
Run 62 : 136ms
Run 63 : 140ms
Run 64 : 128ms
Run 65 : 131ms
Run 66 : 136ms
Run 67 : 147ms
Run 68 : 122ms
Run 69 : 127ms
Run 70 : 135ms
Run 71 : 135ms
Run 72 : 135ms
Run 73 : 134ms
Run 74 : 126ms
Run 75 : 125ms
Run 76 : 217ms
Run 77 : 765ms
Run 78 : 2973ms
Run 79 : 2459ms
Run 80 : 2180ms
Run 81 : 4733ms
Run 82 : 7484ms
Run 83 : 3177ms
Run 84 : 6209ms
Run 85 : 5129ms
Run 86 : 1051ms
Run 87 : 5177ms
Run 88 : 5515ms
Run 89 : 6217ms
Run 90 : 8865ms
Run 91 : 7981ms
Run 92 : 3578ms
Run 93 : 3472ms
Run 94 : 3645ms
Run 95 : 4006ms
Run 96 : 3933ms
Run 97 : 4211ms
Run 98 : 3127ms
Run 99 : 3714ms
Run 100 : 3811ms
This gives us an average of 1656 and variance of 6,000,000. There are also times when the system is almost unusable because of the time wasted in GC.
One of our Java Specialist Club members, Marek Bickos, quickly discovered a flag that made this code run much faster. By simply invoking the code with -XX:+AggressiveOpts, it caused no GC at all:
Generating values...
Benchmarking...
Run 1 : 56ms
Run 2 : 45ms
Run 3 : 39ms
Run 4 : 50ms
Run 5 : 66ms
Run 6 : 39ms
Run 7 : 39ms
Run 8 : 40ms
Run 9 : 46ms
Run 10 : 46ms
A bit of experimentation explained why this was so much faster. In modern JVMs, the autoboxing cache size is configurable. It used to be that all values from -128 to 127 were cached, but nowadays we can specify a different upper bound. When we run the code with -XX:+AggressiveOpts, it simply increases this value. We can try it out with this little program:
public class FindAutoboxingUpperBound { public static void main(String... args) { int i=0; while(isSame(i,i)) { i++; } System.out.println("Upper bound is " + (i - 1)); i = 0; while(isSame(i,i)) { i--; } System.out.println("Lower bound is " + (i + 1)); } private static boolean isSame(Integer i0, Integer i1) { return i0 == i1; } }
If we run this without any flags, it comes back as:
Upper bound is 127
Lower bound is -128
With -XX:+AggressiveOpts, we get these results:
Upper bound is 20000
Lower bound is -128
It is thus pretty obvious why it is so much faster. Instead of constructing new objects, it is simply getting them from the Integer cache.
If we increase the MAX_VALUE to 100000, and run the program with -XX:+AggressiveOpts and -Xmx500m, we get these results:
Generating values...
Benchmarking...
Run 1 : 1622ms
Run 2 : 1351ms
Run 3 : 2675ms
Run 4 : 2037ms
Run 5 : 2322ms
Run 6 : 2325ms
Run 7 : 2195ms
Run 8 : 2394ms
Run 9 : 2264ms
Run 10 : 2109ms
However, with my flags of "-Xmx10g -Xms10g -XX:NewSize=9g", it is still as fast as before.
The question is thus, which is the better solution? I think it depends very much on how this is going to be used in the real world. My solution would work with any objects that are being made, not just Integers. However, we also saw the serious downside to my approach. Once we do fill up our young space, the JVM will get very slow, at least for a while.
The solution Olivier was looking for was to set the maximum
number of values in the integer cache using the system
parameter java.lang.Integer.IntegerCache.high,
for example:
java -Djava.lang.Integer.IntegerCache.high=1000 Quiz44
Generating values...
Benchmarking...
Run 1 : 40ms
Run 2 : 54ms
Run 3 : 48ms
Run 4 : 49ms
Run 5 : 66ms
Run 6 : 48ms
Run 7 : 48ms
Run 8 : 49ms
Run 9 : 48ms
Run 10 : 66ms
The -XX:AutoBoxCacheMax=1000 flag would do the same thing. Whilst it is possible to specify the maximum Integer in the cache, no doubt to satisfy some silly microbenchmark to prove that Java is at least as fast as Scala, it is currently not possible to set the minimum Integer.
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.