Abstract: In our previous newsletter we posted a puzzle about ArrayList creation in ZGC. Lots of Java programmers sent in their ideas, and we are still looking for the perfect answer. Here are some more thoughts on the idea, whilst we wait for The Z Garbage Collector Book to arrive.
Welcome to the 335th edition of The Java(tm) Specialists' Newsletter. The best season on Crete, spring, is sadly coming to an end tomorrow and we are entering the hot summer months. Yes, I know that the astronomical summer starts on the summer solstice, but the metereological summer begins tomorrow. *sighs*
In 2017, we did a 15k push-up challenge, where we had to do 15000 in a period of 5 months. It was a lot of fun running into buff Java programmers all over the world. We did another less extreme push-up challenge in 2021, and I kept it up until March this year, doing them every day and accumulating 173793. I took a break for ten weeks to let a spasm in my psoas major muscle recover, and will start again tomorrow morning. Or the day after. Or maybe in July ...
I've also been toying with the idea of running an intermittent fasting challenge, starting with 13 hours and then after a month ramping it up to 18 hours. It would be tied to a charity drive, so that we can also raise funds for a worthy cause. However, Helene pointed out that the average Java programmer does not need to lose weight. Build muscle, yes. Plus diet is very personal and is best advised by a good professional dietician. So as fun as that idea sounded to me, it's been mothballed.
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
Back in January, our wonderful Sharat Chander mentioned a new book The Z Garbage Collector by Erik Österlund [ISBN 1032977728] . I immediately pre-ordered the paperback version, and also offered to do a technical review of the book. Unfortunately, the book was already in its final stages, so Erik graciously declined my attempt to score a sneak preview. Besides, my knowledge of Z is basically coloured pointers bla bla, extremely low latency, use as default. It's been my collector of choice for JavaSpecialists.eu for years.
Last month's puzzle produced some excellent ideas and discussion. Here are the people who responded, in alphabetical order: Aayush Mishra, Anli Shundi, Antoine Dessaigne, Artur Wojciechowski, Changsheng Yang, Erik Osterlund, Eugene Rabii, Fabrice Bibonne, Franklin Nwankwo, Jaromir Hamala, Kirk Pepperdine, Krzysztof Koziol, Lenny Primak, Robert Crida, Stefanos Karasavvidis, Ulrich Grepel, Vijay Lakshminarayanan, and Vivek Ravi. Thank you so much for taking the time to investigate this weird phenomenon.
I don't have a perfect solution to the puzzle, but I do have some ideas that will hopefully add some perspective. First off, as I tried to hint in the previous newsletter, the benchmark was fundamentally broken. If we run an experiment just once, we are unlikely to get meaningful results in Java. We need to let the code cache warm up by running the experiment several times, and then run it for a while. This is what JMH is good at.
The first obvious problem with the code, and which almost everyone found, is that ArrayList starts with a minimally sized array, and then expands that by 50% whenever it runs out of space. When we create a new ArrayList, it makes it with a zero length shared array. Once we add a single element, it replaces that with an array of size 10. That would be the first array creation. It then increases that by 50% to 15, then to 22, 33, 49, etc. We can see how many times it increases to get to 100m, and how many arrays it creates, here:
public class ArrayResizingEvents {
public void main() {
var targetSize = 100_000_000;
var size = 10;
var resizes = 0;
var junk = 0;
while (size < targetSize) {
junk += size;
size *= 1.5;
resizes++;
}
System.out.println("final size = " + size);
System.out.println("resizes = " + resizes);
System.out.println("junk = " + junk);
}
}
Output is the following:
final size = 105136605 resizes = 40 junk = 210273213
In this case, we make 40 arrays in increasing size, fill them, then throw them away and copy the elements to the new arrays. If we pre-size the ArrayList to 100m, then it beats the pants off LinkedList in all tests.
However, there are a couple of things that no one mentioned. First off, it appears from my experiments that for ZGC, a reference will always take 8 bytes. With all the other GCs, as long as the heap size is less than 32GB, the references are by default compressed to 4 byte pointers.
We can verify this with -XX:+PrintFlagsFinal:
java -XX:+PrintFlagsFinal -XX:+UseCompressedOops -XX:+UseG1GC -version | grep Oops
### bool UseCompressedOops = true {product lp64_product} {command line}
java -XX:+PrintFlagsFinal -XX:+UseCompressedOops -XX:+UseZGC -version | grep Oops
### bool UseCompressedOops = false {product lp64_product} {command line}
LinkedList also suffers from each node needing more memory, but in ArrayList's case, we are allocating much larger arrays.
Secondly, on my machine, I have 96GB of memory. By default, the JVM would get 1/4 of that, thus 24GB. To add 100m elements to an unsized ArrayList would use 1.2GB with compressed oops and 2.3GB without. Also, most of the arrays in terms of total bytes would qualify as humongous objects under G1 and large objects under Z.
Thirdly, I mentioned before that we didn't warmup the JVM properly. I thus introduced a new BetterListPuzzle, which by default does 10 warmup runs and then 30 actual runs. We need to determine whether ArrayList really is in fact slower than LinkedList, once we've warmed up the JVM.
import java.lang.management.ManagementFactory;
import java.util.*;
import java.util.function.*;
public abstract class BetterListPuzzle extends ListPuzzle {
public int warmups() { return 10; }
public int repeats() { return 30; }
public void run(
Supplier<? extends List<Integer>> supplier) {
System.out.println("Warmup: 10x");
for (var w = 0; w < warmups(); w++) {
super.run(supplier);
}
System.out.println("Actual runs: 30x");
var tmb = ManagementFactory.getThreadMXBean();
var tim = System.nanoTime();
var cpu = tmb.getCurrentThreadCpuTime();
var usr = tmb.getCurrentThreadUserTime();
try {
for (var r = 0; r < repeats(); r++) {
super.run(supplier);
}
} finally {
tim = System.nanoTime() - tim;
usr = tmb.getCurrentThreadUserTime() - usr;
cpu = tmb.getCurrentThreadCpuTime() - cpu;
System.out.println("Total times:");
System.out.printf("tim=%d,", (tim / 1000000));
System.out.printf("cpu=%d,", (cpu / 1000000));
System.out.printf("usr=%d%n", (usr / 1000000));
}
}
}
Since I want to also run this in older versions of ZGC, such as Java 17 and 21,
I'm going to change the test classes to use the traditional public
static void main(String... args). I've also included a
test for a pre-sized ArrayList.
public class BetterArrayListPuzzle extends BetterListPuzzle {
public static void main(String... args) {
new BetterArrayListPuzzle()
.run(java.util.ArrayList::new);
}
}
public class BetterLinkedListPuzzle extends BetterListPuzzle {
public static void main(String... args) {
new BetterLinkedListPuzzle()
.run(java.util.LinkedList::new);
}
}
public class BetterArrayListPreSizedPuzzle extends
BetterListPuzzle {
public static void main(String... args) {
new BetterArrayListPreSizedPuzzle()
.run(() -> new java.util.ArrayList<>(
100_000_000));
}
}
Now we can run this in Java 17, 21 and 25, using all the various garbage collectors. Here is the total time taken by 30 iterations on ZGC in Java 25:
class java.util.LinkedList: tim=64172,cpu=5209,usr=4961 class java.util.ArrayList: tim=14047,cpu=14035,usr=13356
We see that in terms of user and cpu time, LinkedList is 2.7x faster than an unsized ArrayList, but in elapsed time, it is 4.6x slower. LinkedList used approximately 160ms of cpu/user time per run, but the elapsed time twice shot past 11 seconds. This is something we have to be aware of with ZGC. If it cannot keep up, we can occasionally see very long pauses. Unsized ArrayList had a worst case elapsed time of 1 second, but what was curious about that is that the cpu and user time were about the same. Thus it wasn't a long GC pause like with the LinkedList. For the other runs, it took about 400ms. That is consistent with the overall total result.
If we also add the pre-sized ArrayList into the experiment, we see the following output:
class java.util.LinkedList:
tim=64172,cpu=5209,usr=4961
class java.util.ArrayList:
tim=14047,cpu=14035,usr=13356
class java.util.ArrayList: (pre-sized)
tim=8688,cpu=8663,usr=8450
Interestingly, the ArrayList, even pre-sized, had higher cpu time than LinkedList, but is a lot cheaper than the unsized ArrayList.
For comparison, here are the total run results for G1, Parallel, and Serial for Java 25:
G1 class java.util.LinkedList: tim=131791,cpu=5964,usr=5694 class java.util.ArrayList: tim=49509,cpu=32747,usr=30783 class java.util.ArrayList: (pre-sized) tim=18437,cpu=18419,usr=17574 Parallel class java.util.LinkedList: tim=32176,cpu=4239,usr=4212 class java.util.ArrayList: tim=8901,cpu=7953,usr=7611 class java.util.ArrayList: (pre-sized) tim=7705,cpu=7659,usr=6943 Serial class java.util.LinkedList: tim=22051,cpu=4683,usr=4537 class java.util.ArrayList: tim=10963,cpu=8050,usr=7702 class java.util.ArrayList: (pre-sized) tim=6842,cpu=6827,usr=6800
I read some excellent explanations from my readers as to why ArrayList could be faster, based on profilers. Adding a new node to a LinkedList is cheap, but we pay the price later when we have to collect the nodes again. Plus, if a LinkedList lives in memory for a while, the nodes will no longer occupy a contiguous space in memory, and thus would also not be fast to navigate.
If we roll back to Java 21, we see something very interesting. Here are two results with ZGC, first the default and then generational:
Z class java.util.LinkedList: tim=12453,cpu=5922,usr=5555 class java.util.ArrayList: tim=12170,cpu=12161,usr=11277 class java.util.ArrayList: (pre-sized) tim=7333,cpu=7325,usr=6541 Z Generational class java.util.LinkedList: tim=37908,cpu=5579,usr=5306 class java.util.ArrayList: tim=13328,cpu=13317,usr=12802 class java.util.ArrayList: (pre-sized) tim=7676,cpu=7667,usr=7085
For LinkedList, the risk of long STW pauses has gone up dramatically for Z Generational. In my experiment, the elapsed time was 3x slower. We also see that the ArrayList was a bit slower, but not by much.
The most dramatic issue is if we go from Java 21 to Java 25 and we happen to have very large and complex object graphs. LinkedList in this case went from 12453ms in non-generational Java 21 to a whopping 64172ms in generational Java 25. Be careful of "Allocation Stall". We see here that ZGC immediately starts a Major Collection in the background, but if we keep on allocating new objects, it eventually does an "Allocation Stall", where all our threads are paused with stop-the-world (STW) until it has finished the collection. For my 24 GB heap of tiny node objects, this takes several seconds.
Z
[0.005s][info][gc] Using The Z Garbage Collector
Warmup: 10x
class java.util.LinkedList:
[0.349s][info][gc] GC(0) Major Collection (Warmup)
tim=555,cpu=554,usr=374
tim=555,cpu=555,usr=365
tim=524,cpu=523,usr=331
tim=527,cpu=527,usr=330
tim=532,cpu=529,usr=331
tim=522,cpu=521,usr=330
[5.171s][info][gc] GC(1) Minor Collection (Allocation Stall)
[9.123s][info][gc] Allocation Stall (main) 5642.429ms
[9.130s][info][gc] GC(1) Minor Collection (Allocation Stall)
24576M(100%)->1638M(7%) 3.959s
[9.130s][info][gc] GC(0) Major Collection (Warmup)
2264M(9%)->1642M(7%) 8.781s
[9.134s][info][gc] GC(2) Minor Collection (Allocation Rate)
tim=5976,cpu=322,usr=235
tim=270,cpu=263,usr=212
tim=276,cpu=262,usr=213
tim=277,cpu=264,usr=213
Actual runs: 30x
tim=268,cpu=266,usr=216
tim=267,cpu=267,usr=218
[11.666s][info][gc] Allocation Stall (main) 922.321ms
[11.950s][info][gc] GC(2) Minor Collection (Allocation Rate)
1736M(7%)->24576M(100%) 2.816s
[11.950s][info][gc] GC(3) Minor Collection (Allocation Stall)
[16.925s][info][gc] Allocation Stall (main) 5259.733ms
[16.929s][info][gc] GC(3) Minor Collection (Allocation Stall)
24576M(100%)->5468M(22%) 4.980s
[16.929s][info][gc] GC(4) Major Collection (Warmup)
tim=6424,cpu=239,usr=200
tim=162,cpu=162,usr=158
tim=164,cpu=164,usr=158
tim=162,cpu=162,usr=159
tim=236,cpu=236,usr=201
[22.277s][info][gc] Allocation Stall (main) 4345.753ms
[22.278s][info][gc] Allocation Stall (main) 0.135ms
[22.868s][info][gc] GC(5) Minor Collection (Allocation Stall)
[28.839s][info][gc] Allocation Stall (main) 6561.645ms
[28.842s][info][gc] GC(5) Minor Collection (Allocation Stall)
24576M(100%)->9280M(38%) 5.973s
[28.845s][info][gc] GC(6) Minor Collection (Allocation Rate)
tim=11187,cpu=278,usr=224
When creating ArrayLists, we never encounter the "Allocation Stall". However, we do see something interesting. When a "Minor Collection" occurs, the current thread's cpu and user times also go up. See "GC(7)" in this output. I would have expected the elapsed time to be a lot higher than cpu and user. Could it be that the current thread is somehow used for the large Z page harvesting? I guess I will have to wait for The Z Garbage Collector Book [ISBN 1032977728] to arrive.
tim=393,cpu=393,usr=387
class java.util.ArrayList:
tim=390,cpu=390,usr=384
class java.util.ArrayList:
tim=440,cpu=440,usr=398
class java.util.ArrayList:
[15.150s][info][gc] GC(7) Minor Collection (Allocation Rate)
[15.408s][info][gc] GC(7) Minor Collection (Allocation Rate)
15964M(65%)->1710M(7%) 0.258s
tim=1263,cpu=1263,usr=1162
class java.util.ArrayList:
tim=388,cpu=388,usr=383
class java.util.ArrayList:
tim=388,cpu=388,usr=382
Running the ArrayList bench in Async Profiler for two minutes shows the following top culprits:
ns percent samples top ---------- ------- ------- --- 69440000000 53.78% 6944 java.util.ArrayList.add 29490000000 22.84% 2949 copy_oop_uninit_f 14560000000 11.28% 1456 ZUtils::fill 3090000000 2.39% 309 ZMark::follow_array_elements_small 2690000000 2.08% 269 ZBarrierSetRuntime::load_barrier_on_oop_field_preloaded 2670000000 2.07% 267 ZBarrier::mark_from_young_slow_path 2230000000 1.73% 223 zero_blocks 1270000000 0.98% 127 ZBarrier::relocate_or_remap
And this is what we see with a pre-sized ArrayList:
61490000000 43.17% 6149 java.util.ArrayList.add 21920000000 15.39% 2192 ZBarrierSetRuntime::load_barrier_on_oop_array 11490000000 8.07% 1149 forward_copy_longs 11070000000 7.77% 1107 void ZMark::mark_object<true, true, false, false> 9610000000 6.75% 961 __bzero 5300000000 3.72% 530 void ZMark::mark_object<false, true, false, true> 5130000000 3.60% 513 ZRelocate::relocate_object 4800000000 3.37% 480 ZBarrier::mark_barrier_on_oop_slow_path 4490000000 3.15% 449 ZBarrier::mark_barrier_on_oop_array 3740000000 2.63% 374 ZBarrier::relocate_or_mark 1540000000 1.08% 154 zero_blocks
Kind regards
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.