|
The Java Specialists' Newsletter
Issue 230 2015-06-30
Category:
Tips and Tricks
Java version: Java 1-9 String Substringby Dr. Heinz M. KabutzAbstract: Java 7 quietly changed the structure of String. Instead of an offset and a count, the String now only contained a char[]. This had some harmful effects for those expecting substring() would always share the underlying char[].

Welcome to the 230th edition of The Java(tm) Specialists' Newsletter, written on the
Island of Crete in GREECE. By now, you would have heard
about the ATMs drying up. It is surprisingly calm here.
Some petrol stations ran out of gas as my fellow Cretans decided
for the first time in years to completely fill their tanks.
But that was resolved within a day. ATMs are issuing only 60
EURO to us locals per day (no limit for visitors), but there
seems to be no limit inside the supermarkets and restaurants.
It's also easy to find those ATMs with money - just check for
a bunch of people standing in a queue. As my one friend said
- thanks to these lines of people, he's now discovering ATMs
that he didn't even know existed ;-) Chania seems as deserted
as it is during winter, with a noticeable reduction in
tourists. This is the best possible time for you to come
visit Crete! The Cretan hospitality is shining through
even more than usual. If you can speak Greek and you take
the time to sit with a grandfather in his 70s, you will hear
the reality of what he's been going through. As a tourist,
you won't see any of that. You'll just be invited to drink a
glass of tsikoudia with a hearty smack on your back. You can
probably find excellent deals at this time with flights and
hotels. The beaches are exactly the same as last year,
albeit with less drunken louts. The food is still delicious.
The weather fine and warm. Come. You won't regret it. And
by the way, Chania is the best part of Crete, especially the
Akrotiri :-)
NEW:
Please see our new "Extreme Java" course, combining
concurrency, a little bit of performance and Java 8.
Extreme Java - Concurrency & Performance for Java 8.
String Substring
String is ubiquitous in Java programs. It has changed in
quite a few ways over the last generations of Java. For
example, in very early versions, the generated code of
appending several non-constant Strings together would either
be a call to concat() or a StringBuffer. In Java 1.0 and
1.1, the hashCode() function would check the size of the
String and if it was too long, would add up every 8th
character instead of every one. Of course, considering
memory layout, that optimization would not have been all that
effective anyway. In Java 2, they changed that to every
character always and in Java 3, they cached the hash code.
Whilst this sounds sensible, it wasn't. There are almost no
cases where it helps in real code and it introduces an
assumption that the hash code is unlikely to be zero. It
isn't. Once you find one combination of characters that has
a zero hash code, you can produce an arbitrary long series.
A constant time operation like hashCode() using the cached
value now potentially becomes O(n). They tried to fix this
in Java 7 with the hash32() calculation, which never would
allow a zero value. However, I see that is also gone again
in Java 8.
Recently, my co-trainer Maurice Naftalin (author of Mastering
Lambdas) and I taught our Extreme Java 8
course together, which focuses on concurrency and
performance. I always spend a bit of time on String, as it
is used so much and does tend to appear near the top of many
a profile. From Java 1.0 up to 1.6, String tried to avoid
creating new char[]'s. The substring() method would share
the same underlying char[], with a different offset and
length. For example, in StringChars we have two Strings,
with "hello" a substring of "hello_world". However, they
share the same char[]:
import java.lang.reflect.*;
public class StringChars {
public static void main(String... args)
throws NoSuchFieldException, IllegalAccessException {
Field value = String.class.getDeclaredField("value");
value.setAccessible(true);
String hello_world = "Hello world";
String hello = hello_world.substring(0, 5);
System.out.println(hello);
System.out.println(value.get(hello_world));
System.out.println(value.get(hello));
}
}
In Java 1 through 6, we would see output like this:
Hello
[C@721cdeff
[C@721cdeff
However, in Java 7 and 8, it would instead produce output
with a different char[]:
Hello
[C@49476842
[C@78308db1
"Why this change?", you may ask. It turns out that too many
programmers used substring() as a memory saving method.
Let's say that you have a 1 MB String, but you actually only
need the first 5 KB. You could then create a substring,
expecting the rest of that 1 MB String to be thrown away.
Except it didn't. Since the new String would share the same
underlying char[], you would not save any memory at all.
The correct code idiom was therefore to append the substring
to an empty String, which would have the side effect of
always producing a new unshared char[] in the case
that the String length did not correspond to the char[]
length:
String hello = "" + hello_world.substring(0, 5);
During our course, the customer remarked that they had a real
issue with this new Java 7 and 8 approach to substrings. In
the past they assumed that a substring would generate a
minimum of garbage, whereas nowadays the cost can be quite
high. In order to measure how many bytes exactly are being
allocated, I wrote a little Memory class that uses a
little-known ThreadMXBean feature. The details will be the
subject of another newsletter:
import javax.management.*;
import java.lang.management.*;
public class Memory {
public static long threadAllocatedBytes() {
try {
return (Long) ManagementFactory.getPlatformMBeanServer()
.invoke(
new ObjectName(
ManagementFactory.THREAD_MXBEAN_NAME),
"getThreadAllocatedBytes",
new Object[]{Thread.currentThread().getId()},
new String[]{long.class.getName()}
);
} catch (Exception e) {
throw new IllegalArgumentException(e);
}
}
}
Let's say that I have a large string that I would like to
break up into smaller chunks:
import java.util.*;
public class LargeString {
public static void main(String... args) {
char[] largeText = new char[10 * 1000 * 1000];
Arrays.fill(largeText, 'A');
String superString = new String(largeText);
long bytes = Memory.threadAllocatedBytes();
String[] subStrings = new String[largeText.length / 1000];
for (int i = 0; i < subStrings.length; i++) {
subStrings[i] = superString.substring(
i * 1000, i * 1000 + 1000);
}
bytes = Memory.threadAllocatedBytes() - bytes;
System.out.printf("%,d%n", bytes);
}
}
In Java 6, the LargeString class generates 360,984 bytes, but
in Java 7, it goes up to a whopping 20,441,536 bytes. That's
quite a jump! You can run this code yourself to try out on
your machine.
Unfortunately if we want to have the memory allocation saving
of Java 6, we need to write our own String class.
Fortunately that is not too hard with the CharSequence
interface. Please note that my SubbableString is not
thread safe, nor is it meant to be. I used Brian Goetz's
annotation, albeit in a comment:
//@NotThreadSafe
public class SubbableString implements CharSequence {
private final char[] value;
private final int offset;
private final int count;
public SubbableString(char[] value) {
this(value, 0, value.length);
}
private SubbableString(char[] value, int offset, int count) {
this.value = value;
this.offset = offset;
this.count = count;
}
public int length() {
return count;
}
public String toString() {
return new String(value, offset, count);
}
public char charAt(int index) {
if (index < 0 || index >= count)
throw new StringIndexOutOfBoundsException(index);
return value[index + offset];
}
public CharSequence subSequence(int start, int end) {
if (start < 0) {
throw new StringIndexOutOfBoundsException(start);
}
if (end > count) {
throw new StringIndexOutOfBoundsException(end);
}
if (start > end) {
throw new StringIndexOutOfBoundsException(end - start);
}
return (start == 0 && end == count) ? this :
new SubbableString(value, offset + start, end - start);
}
}
If we now use CharSequence instead of String in the test, we
can avoid creating all those unnecessary char[]s. Here is
the revised test:
import java.util.*;
public class LargeSubbableString {
public static void main(String... args) {
char[] largeText = new char[10000000];
Arrays.fill(largeText, 'A');
CharSequence superString = new SubbableString(largeText);
long bytes = Memory.threadAllocatedBytes();
CharSequence[] subStrings = new CharSequence[
largeText.length / 1000];
for (int i = 0; i < subStrings.length; i++) {
subStrings[i] = superString.subSequence(
i * 1000, i * 1000 + 1000);
}
bytes = Memory.threadAllocatedBytes() - bytes;
System.out.printf("%,d%n", bytes);
}
}
With that improvement, we now use roughly 281000 bytes on
Java 6, 7 and 8. For Java 7 and 8, that would be a 72x
improvement!
Please keep this new "feature" in mind when you do your
migration from Java 6 to Java 8. I know, too many of my
customers are stuck on 6 and are finding it hard to find a
business case for funding the move. Besides the syntactic
advantages in Java 7 and 8, you will also want to move away
from the bugs still stuck in Java 6. The sooner the better!
Kind regards
Heinz
Tips and Tricks Articles
Related Java Course
|