Home of The JavaSpecialists' Newsletter

237String Compaction

Posted: 2016-04-16Category: PerformanceJava Version: 1.6,1.9Dr. Heinz M. Kabutz
 

Abstract: Java 6 introduced a mechanism to store ASCII characters inside byte[] instead of a char[]. This feature was removed again in Java 7. However, it is coming back in Java 9, except this time round, compaction is enabled by default and byte[] is always used.

 

Welcome to the 237th edition of The Java(tm) Specialists' Newsletter, written at 30'000 feet en route back to Greece from Düsseldorf. I think this was the first time that I visited that great city and did not go to eat dinner at the Füchschen. They make an incredible pork knuckle, which is best washed down with never-ending glasses of Füchschen Alt. Great food, incredible atmosphere, drinkable Alt. Like many German breweries, they use a variation of Mark-Sweep-Compact, so you might end up being inserted in between two burly Germans at their table. I've not seen that in other countries outside of Greater Germania. Very efficient use of space and they will always find a space for a lonely traveler. But not this time. I'm working hard at improving my health and sacrifices need to be made. I didn't regret that sacrifice this morning! In fact in my 44 years of life, I have never ever woken up and regretted not drinking more beer the night before :-) Have you?

Birthday and Anniversary Special: Since the 30th November 2017 is the 17th anniversary of our newsletter and my birthday is coming up on the 4th December, we are giving away a 30% discount on our new Data Structures for Java 9 Course (Late 2017 Edition). Whether you are a seasoned Java programmer or you just want to get ready for your next job interview, this course will help you. Besides detailed lectures, the course has over 130 questions that will help you discover what you missed.

String Compaction

I spent this week teaching a class of clever Java programmers from Provinzial Rheinland Versicherung AG the subtleties of Java Design Patterns. You might be surprised to learn that my Patterns Course is overall my most successful course. Not concurrency (although that sells well too). Not introduction to Java (haven't sold one in years). Not even Advanced Topics in Java. Nope, my humble Design Patterns Course that I wrote in 2001 is still selling strong today.

Usually when I teach my patterns course, we look at all sorts of related topics in Java. So the students learn far more than they would find in any single book. They see where the patterns are used in the JDK. They learn about good design principles. They see the latest Java 8 features, even it they might still be stuck on JDK 6 or 7. We even touch a bit on threading. It is the one course that once a company has sent some programmers on it, they usually just keep on sending more and more. This is why, 15 years after writing the first edition, it is still so popular with companies.

Yesterday we were looking at the Flyweight Pattern, which is a rather strange arrangement of classes. It isn't really a design pattern. Rather, like the Facade, it is a necessary evil because of design patterns. Let me explain. A good object oriented design will result in systems that are highly configurable and that minimize duplicated code. This is good, but it also means that you sometimes have to do a lot of work to just use the system. Facade helps to make a complex subsystem easier to use. Why is it complex? Usually because we have too many options of how to use it, thanks to a liberal application of design patterns. Flyweight has a similar reason for being. Usually good OO designs have more objects than bad monolithic designs, where everything is a Singleton. Flyweight tries to reduce the number of objects through sharing, which works well when we have made intrinsic state to be extrinsic instead.

I was demonstrating String deduplication to my class yesterday, where the char[] inside the String is replaced with a shared char[] when we have multiple Strings containing the same values. To use it you need to use the G1 collector (-XX:+UseG1GC) and also turn String deduplication on (-XX:+UseStringDeduplication). It worked beautifully in Java 8. I then wanted to see whether this was enabled by default in Java 9, knowing that the G1 collector was now the default collector. I was a bit surprised when my code threw a ClassCastException when I tried to cast the value field in String to a char[].

At some point in Java 6, we got compressed Strings. These were off by default and you could turn them on with -XX:+UseCompressedStrings. When they were on, Strings containing only ASCII (7-bit) characters would automatically be changed to contain a byte[]. If you had one character that was more than 7 bits in size, it used char[] again. Things got interesting when you had UTF-16 characters such with Devanagari Hindi, because then additional objects were created and we actually had higher object creation than without compressed strings. But for US ASCII, life was good. For some reason, this feature from Java 6 was deprecated in Java 7 and the flag was completely removed in Java 8.

However, in Java 9, a new flag was introduced -XX:+CompactStrings and this is now enabled by default. If you look inside the String class, you will notice that it always stores the characters of a String inside a byte[]. It also has a new byte field that stores the encoding. This is currently either Latin1 (0) or UTF16 (1). Potentially it could be other values too in future. So if your character are all in the Latin1 encoding, then your String will use less memory.

To try this out, I have written a small Java program that we can run in Java 6, 7 and 9 to spot the differences:

import java.lang.reflect.*;

public class StringCompactionTest {
  private static Field valueField;

  static {
    try {
      valueField = String.class.getDeclaredField("value");
      valueField.setAccessible(true);
    } catch (NoSuchFieldException e) {
      throw new ExceptionInInitializerError(e);
    }
  }

  public static void main(String... args)
      throws IllegalAccessException {
    showGoryDetails("hello world");
    showGoryDetails("hello w\u00f8rld"); // Scandinavian o
    showGoryDetails("he\u03bb\u03bbo wor\u03bbd"); // Greek l
  }

  private static void showGoryDetails(String s)
      throws IllegalAccessException {
    s = "" + s;
    System.out.printf("Details of String \"%s\"\n", s);
    System.out.printf("Identity Hash of String: 0x%x%n",
        System.identityHashCode(s));
    Object value = valueField.get(s);
    System.out.println("Type of value field: " +
        value.getClass().getSimpleName());
    System.out.println("Length of value field: " +
        Array.getLength(value));
    System.out.printf("Identity Hash of value: 0x%x%n",
        System.identityHashCode(value));
    System.out.println();
  }
}
  

The first run is with Java 6 and -XX:-UseCompressedStrings (the default). Notice how each of the Strings contains a char[] internally.

Java6 no compaction
java version "1.6.0_65"

Details of String "hello world"
Identity Hash of String: 0x7b1ddcde
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x6c6e70c7

Details of String "hello wørld"
Identity Hash of String: 0x46ae506e
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x5e228a02

Details of String "heλλo worλd"
Identity Hash of String: 0x2d92b996
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x7bd63e39
  

The second run is with Java 6 and -XX:+UseCompressedStrings. The "hello world" String contains a byte[] and the other two a char[]. Only US ASCII (7-bit) are compressed.

Java6 compaction
java version "1.6.0_65"

Details of String "hello world"
Identity Hash of String: 0x46ae506e
Type of value field: byte[]
Length of value field: 11
Identity Hash of value: 0x7bd63e39

Details of String "hello wørld"
Identity Hash of String: 0x42b988a6
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x22ba6c83

Details of String "heλλo worλd"
Identity Hash of String: 0x7d2a1e44
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x5829428e
  

Java 7 the flag was ignored. In Java 8 it was removed, so a JVM started with -XX:+UseCompressedStrings would fail. Of course all Strings just contained char[].

Java7 compaction
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
    UseCompressedStrings; support was removed in 7.0
java version "1.7.0_80"

Details of String "hello world"
Identity Hash of String: 0xa89848d
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x57fd54c4

Details of String "hello wørld"
Identity Hash of String: 0x38c83cfd
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x621c232a

Details of String "heλλo worλd"
Identity Hash of String: 0x2548ccb8
Type of value field: char[]
Length of value field: 11
Identity Hash of value: 0x4e785727
  

Java 9 we have a new flag -XX:+CompactStrings. It is on by default. Strings now always store their payload as a byte[], regardless of the encoding. You can see that for Latin1, the bytes are packed.

Java9 compaction
java version "9-ea"

Details of String "hello world"
Identity Hash of String: 0x77f03bb1
Type of value field: byte[]
Length of value field: 11
Identity Hash of value: 0x7a92922

Details of String "hello wørld"
Identity Hash of String: 0x71f2a7d5
Type of value field: byte[]
Length of value field: 11
Identity Hash of value: 0x2cfb4a64

Details of String "heλλo worλd"
Identity Hash of String: 0x5474c6c
Type of value field: byte[]
Length of value field: 22
Identity Hash of value: 0x4b6995df
  

Of course you can turn off this new feature in Java 9 with -XX:-CompactStrings. However, the code within String has changed, so regardless of what you do, value is still a byte[].

Java9 no compaction
java version "9-ea"

Details of String "hello world"
Identity Hash of String: 0x21a06946
Type of value field: byte[]
Length of value field: 22
Identity Hash of value: 0x25618e91

Details of String "hello wørld"
Identity Hash of String: 0x7a92922
Type of value field: byte[]
Length of value field: 22
Identity Hash of value: 0x71f2a7d5

Details of String "heλλo worλd"
Identity Hash of String: 0x2cfb4a64
Type of value field: byte[]
Length of value field: 22
Identity Hash of value: 0x5474c6c
  

Anybody using reflection to access the gory details inside String will now potentially get ClassCastException. Hopefully the set of such programmers is infinitesimally small.

A bigger worry is performance. Methods like String.charAt(int) used to be fast like lightning. I could detect a slow down in Java 9. If you're doing a lot of String walking with charAt(), you might want to explore alternatives. Not sure what they are though! Or perhaps they will fix this in the final release of Java 9, after all the version I'm looking at is Early Release (EA).

I heard about a trick by Peter Lawrey at one of our JCrete Unconferences. String has a constructor that takes a char[] and a boolean as a parameter. The boolean is never used and you are supposed to pass in true, meaning that char[] will be used directly within the String and not copied. Here is the code:

String(char[] value, boolean share) {
    // assert share : "unshared not supported";
    this.value = value;
}
  

Lawrey's trick was to create Strings very quickly from a char[] by calling that constructor directly. Not sure of the details, but most probably was done with JavaLangAccess that we could get from the SharedSecrets class. Prior to Java 9, this was located in the sun.misc package. Since Java 9, it is in jdk.internal.misc. I hope you are not using this method directly, because you will have to change your code for Java 9. But that's not all. Since String in Java 9 does not have char[] as value anymore, the trick does not work. String will still make a new byte[] every time you call it, making it about 2.5 times slower on my machine in Java 9.

Here's the code. You will have to correct the imports depending on what version of Java you are using.

//import sun.misc.*; // prior to Java 9, use this
import jdk.internal.misc.*; // since Java 9, use this instead

public class StringUnsafeTest {
  private static String s;

  public static void main(String... args) {
    char[] chars = "hello world".toCharArray();
    JavaLangAccess javaLang = SharedSecrets.getJavaLangAccess();
    long time = System.currentTimeMillis();
    for (int i = 0; i < 100 * 1000 * 1000; i++) {
      s = javaLang.newStringUnsafe(chars);
    }
    time = System.currentTimeMillis() - time;
    System.out.println("time = " + time);
  }
}
  

To summarize, if you speak English or German or French or Spanish, your Strings have just become a whole lot lighter. For Greeks and Chinese, they are using about the same. For all, Strings are probably going to be a little bit slower.

Kind regards from Thessaloniki Airport

Heinz

 

Related Articles

Browse the Newsletter Archive

About the Author

demo

Java Champion, author of the Javaspecialists Newsletter, conference speaking regular... About Heinz

Java Training

We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.

Java Consulting

Nobody ever wants to call a Java performance consultant, but with first-hand experience repairing and improving commercial Java applications - JavaSpecialists are a good place to start...

Threading Emergency?

If your system is down, we will review it for 15 minutes and give you our findings for just 1 € without any obligation.