Java Specialists' Java Training Europehome of the java specialists' newsletter

The Java Specialists' Newsletter
Issue 217b2014-02-11 Category: Performance Java version: Java 8

GitHub Subscribe Free RSS Feed

Parallel Parking (Follow-up)

by Dr. Heinz M. Kabutz
Abstract:
Heavily contended concurrent constructs may benefit from a small pause after a CAS failure, but this may lead to worse performance if we do more work in between each update attempt. In this follow-up newsletter, we show how adding CPU consumption can change our performance characteristics.

I do not usually send a follow-up newsletter, but this one is too important to keep waiting for a long time. In my last newsletter, I showed how parking on CAS failure could improve our performance on heavily contended CAS code. I also pointed out twice how one would need to carefully tune and choose the delay according to specific performance characteristics of one's system. But the message still did not come through. As a result, I decided to send a follow-up email to my previous newsletter.

NEW: We have revised our "Advanced Topics" course, covering Reflection, Java NIO, Data Structures, Memory Management and several other useful topics for Java experts to master. 2 days of extreme fun and learning. Extreme Java - Advanced Topics.

Parallel Parking (Follow-up)

Martin Thompson kindly pointed out that instead of using the park function, we really should use the "86 PAUSE instruction inside spin loops to avoid speculative execution that cause latency in such tests and reduce the waste of CPU resource when spinning." Unfortunately this function is not available in Java. Maybe in future we can have a Thread.pause(). Thread.yield() is not the same, as it is OS dependent. Here is the C++ code from the file jvm.cpp:

JVM_ENTRY(void, JVM_Yield(JNIEnv *env, jclass threadClass))
  JVMWrapper("JVM_Yield");
  if (os::dont_yield()) return;
#ifndef USDT2
  HS_DTRACE_PROBE0(hotspot, thread__yield);
#else /* USDT2 */
  HOTSPOT_THREAD_YIELD();
#endif /* USDT2 */
  // When ConvertYieldToSleep is off (default), this matches the
  // classic VM use of yield.
  // Critical for similar threading behaviour
  if (ConvertYieldToSleep) {
    os::sleep(thread, MinSleepInterval, false);
  } else {
    os::yield();
  }
JVM_END
  

The flag -XX:+DontYieldALot effectively turns the Thread.yield() into a no-op on all the platforms I looked at.

In order to demonstrate the danger of having a delay that is too long in a less contended lock, I modified the benchmark to add some busy CPU work in between each of the update methods using the JMH BlackHole.consumeCPU() method. Here is my new Benchmarking class. The larger the TOKENS environment variable, the more CPU work is done:

package eu.javaspecialists.microbenchmarks.numberrange;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.logic.*;

import static java.util.concurrent.TimeUnit.*;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(NANOSECONDS)
@Warmup(iterations = 5, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = SECONDS)
@Fork(5)
@State(Scope.Benchmark)
@Threads(8)
public class Microbenchmark0 {
  public static final int TOKENS = Integer.getInteger("TOKENS");
  private final NumberRangeSynchronized nrSync =
      new NumberRangeSynchronized();
  private final NumberRangeAtomic nrAtomic =
      new NumberRangeAtomic();
  private final NumberRangeAtomicWithPark nrAtomicPark =
      new NumberRangeAtomicWithPark();

  @GenerateMicroBenchmark
  public void test_0_1_synchronized() {
    nrSync.setUpper(100);
    BlackHole.consumeCPU(TOKENS);
    nrSync.setLower(10);
    BlackHole.consumeCPU(TOKENS);
    nrSync.setLower(50);
    BlackHole.consumeCPU(TOKENS);
    nrSync.setUpper(200);
    BlackHole.consumeCPU(TOKENS);
    nrSync.setLower(30);
    BlackHole.consumeCPU(TOKENS);
  }

  @GenerateMicroBenchmark
  public void test_0_2_atomic() {
    nrAtomic.setUpper(100);
    BlackHole.consumeCPU(TOKENS);
    nrAtomic.setLower(10);
    BlackHole.consumeCPU(TOKENS);
    nrAtomic.setLower(50);
    BlackHole.consumeCPU(TOKENS);
    nrAtomic.setUpper(200);
    BlackHole.consumeCPU(TOKENS);
    nrAtomic.setLower(30);
    BlackHole.consumeCPU(TOKENS);
  }

  @GenerateMicroBenchmark
  public void test_0_3_atomic_park() {
    nrAtomicPark.setUpper(100);
    BlackHole.consumeCPU(TOKENS);
    nrAtomicPark.setLower(10);
    BlackHole.consumeCPU(TOKENS);
    nrAtomicPark.setLower(50);
    BlackHole.consumeCPU(TOKENS);
    nrAtomicPark.setUpper(200);
    BlackHole.consumeCPU(TOKENS);
    nrAtomicPark.setLower(30);
    BlackHole.consumeCPU(TOKENS);
  }
}
  

In order to show the data without too much clutter, I've left out the "mean error" values, but all the raw data is available on demand. These results are when we use 180 microseconds as a backoff delay.

   -DTOKEN=     0   10   20   40   80  160   320   640  1280  2560
synchronized 1623 2626 3444 5278 5924 6182  5429  8146 15067 29407
atomic       4229 3412 3509 3295 2733 2053  3663  7160 13947 27854
atomic_park   485  844 1498 3507 6504 9917 14317 18814 27844 38242
  

The atomic parking version is faster when the update method is extremely highly contended. However, as the contention lessens, the parking starts to take over as a main denominator. We see that in some cases, the parking version is about 4x slower than the plain CAS version.

I also ran the same experiment with less threads than 8. Here are the results with 4 threads on my i7 machine:

   -DTOKEN=     0   10   20   40   80  160  320  640  1280  2560
synchronized  804 1227 1737 2593 2706 2333 3517 6775 13233 26155
atomic       1604 1515 1484 1190 1013 1749 3314 6490 12878 25539
atomic_park   282  476  753 1857 2260 2060 5457 7307 13847 27159
  

And here are the results with 2 threads on my i7 machine:

  -DTOKEN=     0  10  20   40   80  160  320  640  1280  2560
synchronized 390 605 555 1027 1011 1797 3346 6479 12617 24929
atomic       615 583 692  472  870 1651 3162 6268 12353 24670
atomic_park  126 197 338  697 1093 2202 3531 7303 12723 25001
  

Running the same experiment with 8 threads on my 2-4-1 server (total of 8 cores) gave the following results:

   -DTOKEN=     0   10   20   40   80  160  320   640  1280  2560
synchronized 6346 8019 9027 7760 7202 8517 6562 12632 24941 49250
atomic       1430 1129 1147 1237 1589 2956 5758 11238 22410 44647
atomic_park   795 2007 3760 6229 7140 7015 7565 13423 25214 49557
  

Thus the technique for pausing for a short while after a CAS failure would only help you if the update code is very heavily contended, otherwise it would be counterproductive.

As with most benchmarks like this, we are experiencing lots of interesting effects as a result of cache miss costs. Please don't add parks to all your classes and expect your system to magically become 10x faster. As always, we want to tune our identified and proven bottlenecks. If you'd like to know more on how to do that, please have a look at our Java Specialist Master Course.

Kind regards from Crete

Heinz

Performance Articles Related Java Course

Extreme Java - Concurrency and Performance for Java 8
Extreme Java - Advanced Topics for Java 8
Design Patterns
In-House Courses

© 2010-2016 Heinz Kabutz - All Rights Reserved Sitemap
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. JavaSpecialists.eu is not connected to Oracle, Inc. and is not sponsored by Oracle, Inc.