Java Specialists' Java Training Europehome of the java specialists' newsletter

The Java Specialists' Newsletter
Issue 2172014-02-07 Category: Performance Java version: Java 8

GitHub Subscribe Free RSS Feed

Parallel Parking

by Dr. Heinz M. Kabutz
Abstract:
What is faster? Synchronized or Atomic compare-and-set? In this newsletter we look at some surprising results when we run Java on the new i7 chip.

Welcome to the 217th issue of The Java(tm) Specialists' Newsletter. In my last newsletter, I mentioned we'd be heading off in search of snow on the Island of Crete. It took a bit of effort, but we were successful! Have a look at a short clip of our adventure (make sure to watch in HD).

From the 25-29 August 2014, we are running the 4th Java Specialists' Symposium aka JCrete 2014. [Note that the venue has changed to the Kolymbari area of Chania - more information coming soon]. Here are some comments from previous attendees:

"In conferences with huge numbers of attendees you'll get great canned talks and lots of merchandise from the sponsors. You might even get a good meal at the hotel restaurant. If that's what you like don't go to JCrete. In JCrete it's extremely, and sunny, you only get 3 talks a day and wifi is poor. You talk with other attendees, get to know them while having a salad at a small taverna by the beach (sometimes a taverna _in_ the beach). You get to meet the great people like I met Maurice Naftalin, Heinz Kabutz or Martin Thompson while drinking home-made Raki. You get to spend all-nighters discussing last Vimeo hit by Bret Victor with people worried about our future as developer/computer scientists. That and great small-group sessions on garbage collecting, language abusing, concurrency, java byte code and many other subjects is what you get at JCrete. But if you are the type that enjoys the Martinis at the Oracle Evening Party at QCon then JCrete is not for you." - Ignasi Marimon-Clos i Sunyol (Spain)

And another one by a regular attendee:

"JCrete is the best combination of multiple worlds. In the morning you are engaged in interesting topics with highly intelligent professionals, often learning new stuff or getting interesting views on existing technology. In the afternoon you spend time with your family or fellow geeks, snorkeling, hiking or just relaxing. Later you eventually gather with the rest of the hard-cores in the foo-bar discussing java, politics, railways or whatever. And at all times you are supported by beautiful surroundings, Greek hospitality and great local food!" - Jesper Udby (Denmark)

Oh and last, but not least, a short video to demonstrate the types of things we do at our Unconference. It is very different to other more traditional conferences. We might only have three sessions per day, but the discussions carry on whilst we are swimming in the sea or enjoying a delicious meal. Places are limited to 70 attendees, so please put your name down on the list ASAP. There are no conference fees, but you need to cover your own travel expenses (flight, food, hotel, entertainment).

NEW: We have revised our "Advanced Topics" course, covering Reflection, Java NIO, Data Structures, Memory Management and several other useful topics for Java experts to master. 2 days of extreme fun and learning. Extreme Java - Advanced Topics.

Parallel Parking

After waiting far too long, I finally upgraded to a quad-core i7. Proud to show off my shiny new hardware and hack programming skills to some Concurrency Specialist Course students, I fired it up and waited for them to be dazzled by the awesome speed. But let me go back a few steps. We ended day one of the concurrency course with various versions of NumberRange. The idea was inspired by Brian Goetz's excellent book Java Concurrency in Practice. We start with an interface:

package eu.javaspecialists.microbenchmarks.numberrange;

public interface NumberRange {
  boolean setLower(int newLower);
  boolean setUpper(int newUpper);
  boolean isInRange(int number);
  boolean checkCorrectRange();
}
  

In JCiP, Brian explains that this version of NumberRange is broken, because even though the lower and upper objects are in themselves threadsafe, we have an invariant across both fields that is not protected through some form of synchronization:

package eu.javaspecialists.microbenchmarks.numberrange;

import java.util.concurrent.atomic.*;

// does not work correctly - we could get upper < lower!
public class NumberRangeAtomicBroken implements NumberRange {
  private final AtomicInteger lower = new AtomicInteger(0);
  private final AtomicInteger upper = new AtomicInteger(0);

  public boolean setLower(int newLower) {
    if (newLower > upper.get()) return false;
    lower.set(newLower);
    return true;
  }

  public boolean setUpper(int newUpper) {
    if (newUpper < lower.get()) return false;
    upper.set(newUpper);
    return true;
  }

  public boolean isInRange(int number) {
    return number >= lower.get() && number <= upper.get();
  }

  public boolean checkCorrectRange() {
    int lower = this.lower.get();
    int upper = this.upper.get();
    return lower <= upper;
  }
}
  

It is easy to reason why this does not work, but here is a unit test that demonstrates the problem (if you still have questions, read the book :-)):

private void testConcurrently(NumberRange range)
    throws InterruptedException {
  range.setLower(0);
  range.setUpper(200);
  ExecutorService pool = Executors.newCachedThreadPool();
  pool.submit(() -> {
    for (int i = 0; i < 100_000_000; i++) {
      range.setLower(100);
      range.setLower(0);
    }
  });
  pool.submit(() -> {
    for (int i = 0; i < 100_000_000; i++) {
      range.setUpper(40);
      range.setUpper(200);
    }
  });
  Future<Boolean> checker = pool.submit(() -> {
    for (int i = 0; i < 100_000_000; i++) {
      if (!range.checkCorrectRange()) return false;
    }
    return true;
  });
  try {
    assertTrue(checker.get());
  } catch (ExecutionException e) {
    fail(e.getCause().toString());
  }
  pool.shutdown();
}
  

There is a very easy fix and that is to use synchronized instead of AtomicInteger fields. For example:

package eu.javaspecialists.microbenchmarks.numberrange;

public class NumberRangeSynchronized implements NumberRange {
  // INVARIANT: lower <= upper
  private int lower = 0;
  private int upper = 0;

  public synchronized boolean setLower(int newLower) {
    if (newLower > upper) return false;
    lower = newLower;
    return true;
  }

  public synchronized boolean setUpper(int newUpper) {
    if (newUpper < lower) return false;
    upper = newUpper;
    return true;
  }

  public synchronized boolean isInRange(int number) {
    return number >= lower && number <= upper;
  }

  public synchronized boolean checkCorrectRange() {
    return lower <= upper;
  }
}
  

However, this means that we have synchronized methods and so the threads will only be able to call these one at a time.

I then proceeded to show another much cleverer solution that used long and bit masking, together with a compare-and-swap (CAS) to set upper and lower in a single operation. Here it is:

package eu.javaspecialists.microbenchmarks.numberrange;

import java.util.concurrent.atomic.*;
import java.util.concurrent.locks.*;

public class NumberRangeAtomic implements NumberRange {
  private final AtomicLong range = new AtomicLong(0);

  public boolean setLower(int newLower) {
    while (true) {
      long current = range.get();
      int upper = getUpper(current);
      int lower = getLower(current);
      if (newLower > upper) return false;
      long next = combine(newLower, upper);
      if (range.compareAndSet(current, next)) return true;
    }
  }

  public boolean setUpper(int newUpper) {
    while (true) {
      long current = range.get();
      int upper = getUpper(current);
      int lower = getLower(current);
      if (newUpper < lower) return false;
      long next = combine(lower, newUpper);
      if (range.compareAndSet(current, next)) return true;
    }
  }

  public boolean isInRange(int number) {
    long current = range.get();
    int upper = getUpper(current);
    int lower = getLower(current);
    return number >= lower && number <= upper;
  }

  private long combine(long lower, long upper) {
    return upper | (lower << 32);
  }

  private int getLower(long range) {
    return (int) (range >>> 32);
  }

  private int getUpper(long range) {
    return (int) (range);
  }

  public boolean checkCorrectRange() {
    long range = this.range.get();
    int lower = getLower(range);
    int upper = getUpper(range);
    return lower <= upper;
  }
}
  

We finished for the day and said our virtual goodbyes. My students were in Brazil(1), Luxembourg(1), USA(2), England(1), Austria(1), Romania(1), Sweden(1) and I was at home in Greece(1). So for some of them, the day had just begun and for others, it was beer time. These virtual classes are fun and we all learn a lot. We keep numbers small so that everyone gets a chance to ask their questions. Have a look at our Concurrency, Master and Patterns Courses.

Each morning, I like to start by giving my students an opportunity to ask questions that they had come up with in the night. Our brain needs a while to reset itself and so taking a complete break often brings doubts to the surface that they did not realize they had. Sometimes this Q&A session goes on for over an hour as we go freestyle off the script and answer the questions that students have.

In our particular Q&A session, James S. asked what the point of that bit shifting code was. It seemed so much more complicated to him and he could not justify writing code that had such a high chance of bugs. I explained that this type of programming style would be typically used for identified bottlenecks. It would be much faster though, I confidently asserted. To prove it, I whipped up a quick microbenchmark. Now I've written hundreds of microbenchmarks and they usually give me the results I want pretty quickly. However, in this case, it showed me that the NumberRangeSynchronized class was actually several times faster than the far more complicated NumberRangeAtomic. I had seen similar results on other fast i7 laptops, but since I had only had my machine for a couple of days, was not able to investigate exactly why. I muttered something about "cache synchronization" and "data moving between cores" and "CAS being a known weakness", but was not satisfied with my answer. I also told them that there was something funny going on and that I would investigate this further.

Instead of showing you the slapped-together benchmark that I coded live in class, I'd rather show you how to do it with OpenJDK's excellent Java Microbenchmarking Harness (JMH) tool. JMH takes care of all the benching things, such as statistics, starting and stopping threads, forking the processes, etc. One thing I really like is that JMH typically measures the amount of work done within a time interval, rather than measuring how long a certain number of steps takes. This is also my preferred way of measuring code. Of course, even JMH can sometimes be subject to weird Hotspot optimizations, so please check whatever results you get and see whether they make sense. If unsure, inspect the generated assembler code.

package eu.javaspecialists.microbenchmarks.numberrange;

import org.openjdk.jmh.annotations.*;
import static java.util.concurrent.TimeUnit.*;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(NANOSECONDS)
@Warmup(iterations = 10, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 20, time = 1, timeUnit = SECONDS)
@Fork(5)
@State(Scope.Benchmark)
@Threads(8)
public class Microbenchmark1 {
  private final NumberRangeSynchronized nrSync =
      new NumberRangeSynchronized();
  private final NumberRangeAtomic nrAtomic =
      new NumberRangeAtomic();

  @GenerateMicroBenchmark
  public void test_1_1_synchronized() {
    nrSync.setUpper(100);
    nrSync.setLower(10);
    nrSync.setLower(50);
    nrSync.setUpper(200);
    nrSync.setLower(30);
  }

  @GenerateMicroBenchmark
  public void test_1_2_atomic() {
    nrAtomic.setUpper(100);
    nrAtomic.setLower(10);
    nrAtomic.setLower(50);
    nrAtomic.setUpper(200);
    nrAtomic.setLower(30);
  }
}
  

I will leave the exercise of getting this to run on your machine as an exercise to the reader. It's really simple though - just follow the instructions on the JMH website. JMH could not have made it easier and I am embarassed that I have so far been too lazy to bother learning it. Anyway, here are the results:

Benchmark                               Mean   Mean error  Units
Microbenchmark1.test_1_1_synchronized 1552.081  15.408     ns/op
Microbenchmark1.test_1_2_atomic       3855.470  77.010     ns/op
  

As we can see in this run, our fancy NumberRangeAtomic was more than twice as slow as the NumberRangeSynchronized! All that effort for nothing!

However, the answer came in the form of Jack Shirazi's Java Performance Tuning Newsletter. In his latest bulletin, he explains that locking can outperform compare-and-swap (CAS), especially if there are a lot of repeats. The trick is that if we have a CAS failure to let the losing thread sleep a little bit in order to give the other threads a chance to go through more often. Even a constant time sleep could improve the situation.

After trying different settings, I settled on a figure of 0.18 ms as working well. But this could be completely different on your machine so it would be best to make this figure tunable or maybe auto-adjust at runtime. Thus we only need to add LockSupport.parkNanos(180_000) after the compareAndSet method has returned false, like so:

public boolean setLower(int newLower) {
  while (true) {
    long current = range.get();
    int upper = getUpper(current);
    int lower = getLower(current);
    if (newLower > upper) return false;
    long next = combine(newLower, upper);
    if (range.compareAndSet(current, next)) return true;
    // now we sleep a constant time
    LockSupport.parkNanos(180_000);
  }
}

public boolean setUpper(int newUpper) {
  while (true) {
    long current = range.get();
    int upper = getUpper(current);
    int lower = getLower(current);
    if (newUpper < lower) return false;
    long next = combine(lower, newUpper);
    if (range.compareAndSet(current, next)) return true;
    // now we sleep a constant time
    LockSupport.parkNanos(180_000);
  }
}
  

Let's run another benchmark, this time with the new NumberRangeAtomicWithPark class that parks after the failure:

package eu.javaspecialists.microbenchmarks.numberrange;

import org.openjdk.jmh.annotations.*;
import static java.util.concurrent.TimeUnit.*;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(NANOSECONDS)
@Warmup(iterations = 10, time = 1, timeUnit = SECONDS)
@Measurement(iterations = 20, time = 1, timeUnit = SECONDS)
@Fork(5)
@State(Scope.Benchmark)
@Threads(8)
public class Microbenchmark2 {
  private final NumberRangeAtomicWithPark nrAtomicPark =
      new NumberRangeAtomicWithPark();

  @GenerateMicroBenchmark
  public void test_2_1_atomic_park() {
    nrAtomicPark.setUpper(100);
    nrAtomicPark.setLower(10);
    nrAtomicPark.setLower(50);
    nrAtomicPark.setUpper(200);
    nrAtomicPark.setLower(30);
  }
}
  

Here are the results with the parking atomic included:

Benchmark                               Mean   Mean error  Units
Microbenchmark1.test_1_1_synchronized 1552.081  15.408     ns/op
Microbenchmark1.test_1_2_atomic       3855.470  77.010     ns/op
Microbenchmark2.test_2_1_atomic_park   430.649   3.125     ns/op
  

We see how the performance has increase almost 10x, simply by parking when we have a CAS failure due to parallel threads trying to write at the same time. So sleeping can actually make your program run faster?

We also see from those results that the NumberRangeAtomicWithPark class is now about 4x faster than the NumberRangeSynchronized. Since I have 4 cores, these results are believable. The CAS code should be faster than the exclusive pessimistic code.

A word of warning though. The 180 microseconds seemed to give good results on my machine, but may be quite bad on yours. Making it configurable would allow you to fine tune it for your target machine.

Please note that I've also published a follow-up to this article, which shows what happens when we do some CPU work in between calls to the update methods.

Kind regards from Crete

Heinz

Performance Articles Related Java Course

Extreme Java - Concurrency and Performance for Java 8
Extreme Java - Advanced Topics for Java 8
Design Patterns
In-House Courses

© 2010-2016 Heinz Kabutz - All Rights Reserved Sitemap
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. JavaSpecialists.eu is not connected to Oracle, Inc. and is not sponsored by Oracle, Inc.