Running on Java 24-ea+25-3155 (Preview)
Home of The JavaSpecialists' Newsletter

211Unicode Redux (2 of 2)

Author: Dr. Wolfgang LaunDate: 2013-05-30Java Version: 5Category: Tips and Tricks
 

Abstract: We continue our discussion on Unicode by looking at how we can compare text that uses diacritical marks or special characters such as the German Umlaut.

 

Welcome to the 211th issue of The Java(tm) Specialists' Newsletter, written in Vienna by our friend Dr. Wolfgang Laun who works for Thales Austria GmbH. This is the second of two parts of a newsletter on the topic of Unicode. You can read the first part here. It was a bit difficult to send out the first part, since I discovered that my wonderful CRM providers were storing characters as ISO-8859-1, rather than a more sensible encoding such as UTF-8. It took me an inordinate amount of time to get most the characters to display correctly, and then they still showed up wrong on some systems. Instead of Unicode Redux, it gave me Unicode Reflux. This edition will probably have worse issues.

Thank you also to those who wrote to congratulate us on the birth of the latest member of our family, Efigenia Gabriela Kabutz, born on the 13th of May 2013. Imagine what the world will look like when Efigenia is your age! I still remember seeing Skype-like devices on Star Trek when I was a kid and being sure that something like that would never exist. We had one of those rotating dials that used voltage pulses to dial the number. You could actually phone without dialing by tapping the off button quickly in a row for all the numbers. So our number 492019 was four taps, nine taps, two taps, ten taps, one tap and nine taps. It was the way you could use the phone if someone had locked the rotating dial. Oh and in our holiday house in Hermanus, which we ended up selling to F.W. de Klerk, we had a phone with a crank that would generate a signal for the exchange. We would then tell the exchange operator which number we wanted and they would connect us. I remember that. Maybe one day Efigenia will reminisce about how when she grew up, she used to plug her appliances into wall sockets for power!

javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.

Unicode Redux (2 of 2)

In the first half of this article, we showed how you can write complete Java programs with just 18 keys. We also explained how character values worked and how character strings were made up. We ended off showing how complicated upper case can be when you have different languages that might not have characters for an upper case letter.

Quiz 2

  1. Can you explain how method words can be called to produce an output like the one shown below?
    private static void words(String w1, String w2 ){
      String letPat = "[^\\p{Cntrl}]+";
      assert w1.matches(letPat) && w2.matches(letPat);
      System.out.println(w1 + " - " + w2 + ": " + w1.equals(w2));
    }
    
    Genève - Genève: false
    
  2. How would you sort a table of strings containing German words?

Combining Diacritical Marks

The codepoints in the Unicode block combining diacritical marks might be called the dark horses in the assembly of letters. They are nothing on their own, but when following the right kind of letter, they unwaveringly exercise their influence, literally "crossing the t's and dotting the i's". They occur in non-Latin alphabets, and they add an almost exotic flavour to the Latin-derived alphabet, with, e.g., the diaeresis decorating vowels (mostly) and the háček adding body to consonants (mostly).

The combining marks can be used in fanciful ways, for instance: O͜O - which should display as if your browser and operating system are rendering the diacritical mark correctly. While there are numerous Unicode codepoints for precombined letters with some diacritic, it is also permitted to represent them by their basic letter followed by the combining diacritical mark, and some applications might prefer it to do it that way. You can guess that this means trouble if your software has to compare words. Method equals in String is certainly not prepared to deal with such subtleties, unless the strings have been subjected to a process called normalization. This can be done by applying the static method normalize of class java.text.Normalizer. Here is a short demonstration.

import java.text.Normalizer;

public class Normalize {
  public boolean normeq(String w1, String w2) {
    if (w1.length() != w2.length()) {
      w1 = Normalizer.normalize(w1, Normalizer.Form.NFD);
      w2 = Normalizer.normalize(w2, Normalizer.Form.NFD);
    }
    return w1.equals(w2);
  }

  public void testEquals(String w1, String w2) {
    System.out.println(w1 + " equals " + w2 + " " + w1.equals(w2));
    System.out.println(w1 + " normeq " + w2 + " " + normeq(w1, w2));
  }
}

The enum constant Normalizer.Form.NFD selects the kind of normalization to apply; here it is just the decomposition step that separates precombined letters into a Latin letter and the diacritical mark. Let's try it out:

public class NormalizeTest {
  public static void main(String[] args) {
    Normalize norm = new Normalize();
    norm.testEquals("Genève", "Gene\u0300ve");
    norm.testEquals("ha\u0301ček", "hác\u030cek");
  }
}

We can see this output. Warning: you might need to use the correct font to view this properly.

Genève equals Genève false
Genève normeq Genève true
háček equals háček false
háček normeq háček false

Do you see what went wrong? The blunder is in method normeq: you can't assume that equal lengths indicate the same normalization state. In the second pair of words, one was written with the first letter composed and the second one decomposed and the other one vice versa, so string lengths are equal, not the character arrays, but the word is the same. There is no shortcut, but we can use this optimistic approach:

public boolean normeq(String w1, String w2) {
  if (w1.equals(w2)) {
    return true;
  } else {
    w1 = Normalizer.normalize(w1, Normalizer.Form.NFD);
    w2 = Normalizer.normalize(w2, Normalizer.Form.NFD);
    return w1.equals(w2);
  }
}

Collating Strings

Class java.lang.String implements java.lang.Comparable, but its method compareTo is just a rudimentary effort, with a resulting collating sequence that isn't good for anything except for storing strings in an array where binary search is used. Consider, for instance, these four words, which are presented in the order Germans expect them in their dictionaries: "Abend", "aber", "morden", "Morgen". Applying Arrays.sort to this set yields "Abend", "Morgen", "aber", "morden", due to all upper case letters in the range 'A' to 'Z' preceding all lower case letters.

Treating an upper case and the corresponding lower case letter as (almost) equal is just one of the many deviations from the character order required in a useful collation algorithm. Also, note that there's a wide range of applied collations, varying by language and usage. German dictionaries, for instance, use a collation where vowels with a diaeresis are ranked immediately after the unadorned vowel, and the letter 'ß', originally resulting from a ligature of 'ſ' (long s) and 'z', is treated like 'ss'. But for lists of names, as in a telephone book, the German Standard establishes the equations 'ä' = 'ae', 'ö' = 'oe' and 'ü' = 'ue'. Book indices may require a very detailed attention, e.g., when mathematical symbols have to be included.

The technical report Unicode Collation Algorithm (UCA) contains a highly detailed specification of a general collation algorithm, with all the bells and whistles required to cope with all nuances for ordering. For anyone planning a non-trivial application dealing with Unicode strings and requiring sorting and searching, this is a must-read, and it's highly informative for anybody with an interest in languages.

Even if not all intricacies outlined in the UCA report are implemented, a generally applicable collating algorithm must support the free definition of collating sequences, and it is evident that this requires more than just the possibility of defining an arbitrary ordering of the characters. The class RuleBasedCollator in java.text provides the most essential features for this. Here is a simple example for the use of RuleBasedCollator.

import java.text.*;
import java.util.*;

public class GermanSort implements Comparator<String> {
  private final RuleBasedCollator collator;

  public GermanSort() throws ParseException {
    collator = createCollator();
  }

  private RuleBasedCollator createCollator() throws ParseException {
    String german = "" +
        "= '-',''' " +
        "< A,a;ä,Ä< B,b< C,c< D,d< E,e< F,f< G,g< H,h< I,i< J,j" +
        "< K,k< L,l< M,m< N,n< O,o;Ö,ö< P,p< Q,q< R,r< S,s< T,t" +
        "< U,u;Ü,ü< V,v< W,w< X,x< Y,y< Z,z" +
        "& ss=ß";
    return new RuleBasedCollator(german);
  }

  public int compare(String s1, String s2) {
    return collator.compare(s1, s2);
  }

  public void sort(String[] strings) {
    Arrays.sort(strings, this);
  }
}

The string german contains the definition of the rules, ranking the 26 letters of the ISO basic Latin alphabet by using the primary relational operator '<'. A weaker ordering principle is indicated by a semicolon, which places an umlaut after its stem vowel, and even less significant is the case difference, indicated by a comma. The initial part defines the hyphen and the apostrophe as ignorable characters. The last relations reset the position to 's', and rank 'ß' as equal to 'ss'. (Note: The javadoc for this class is neither complete nor correct. Use the syntax illustrated in the preceding example for defining ignorables.)

There is, however, a much simpler way to obtain a Collator that is adequate for most collating tasks (or at least a good starting point): simply call method getInstance, preferably with a Locale parameter. This returns a prefabricated RuleBasedCollator object, according to the indicated locale. Make sure to select the locale not only according to language, since the country may affect the collating rules. Also, the Collator instances available in this way may not be up-to-date, as the following little story illustrates. There used to be a French collating rule, requiring the words "cote", "côte", "coté" and "côté" to be in this order, which is in contrast to normal accent ordering, i.e., "cote", "coté" , "côte" and "côté". Not too long ago, this fancy rule has retracted to Canada. But, even with JDK 7, you may have to create a modified Collator by removing the modified '@' from the string defining the sort rules:

Collator collator = Collator.getInstance(new Locale("fr", "FR"));
String rules = ((RuleBasedCollator)collator).getRules();
// '@' is last
rules = rules.substring(0, rules.length()-1);
collator = new RuleBasedCollator(rules);

(Making the preceding code robust is left as an exercise to the reader.)

A Closer Look: Sorting Strings

Comparing Unicode strings according to a rule based collation is bound to be a non-trivial process while the collator rules must be taken into account. You can get an idea of what this means when you look at class CollationElementIterator. This iterator, obtainable for strings by calling the RuleBasedCollator method getCollationElementIterator, delivers sequences of integers that, when compared to each other, result in the correct relation according to the collator. These integers are quite artsy combinations of a character, or a character and the next one; even two or more key integers may result from a single character. For a once-in-a-while invocation of a collator's compare method this isn't going to hurt, but sorting more than a fistful of strings is an entirely different matter.

This is where class CollationKey comes to the rescue. Objects are created by calling the (rule based) collator method getCollationKey for a string. Each object represents a value equivalent to the string's unique position in the set of all strings sorted according to this collator.

Putting this all together, an efficient sort of a collection of strings should create a collection of collation keys and sort it. Conveniently enough, the CollationKey method getSourceString delivers the corresponding string from which the key was created. This is shown in the sort method given below.

public String[] sort(String[] strings) {
  CollationKey[] keys = new CollationKey[strings.length];
  for (int i = 0; i < strings.length; i++) {
    keys[i] = collator.getCollationKey(strings[i]);
  }
  Arrays.sort( keys );
  String[] sorted = new String[strings.length];
  for (int i = 0; i < sorted.length; i++) {
    sorted[i] = keys[i].getSourceString();
  }
  return sorted;
}

Supplementary Characters

Supplementary characters that need to be expressed with surrogate pairs in UTF-16 are uncommon. However, it's important to know where they can turn up and may require precautions in your application code. They include:

  • Emoji symbols and emoticons, for inter-operating with Japanese mobile phones. While the BMP already contains quite a lot of emoticons, hundreds of Emoji characters were encoded in version 6.0 of the Unicode standard.
  • Uncommon (but not unused) CJK (i.e., Chinese, Japanese and Korean) characters, important for personal and place names.
  • Variation selectors for ideographic variation sequences.
  • Important symbols for mathematics.
  • Numerous minority scripts and historic scripts, important for some user communities.
  • Symbols for Domino and Mahjong tiles.

At least, Java applications for mobile devices will have to be aware of the growing number of Emoji symbols. Games could be another popular reason for the need to include supplementary characters.

Property Files

The javadoc for java.util.Properties states that the load and store methods read and write a character stream encoded in ISO 8859-1. This is an 8-bit character set, containing a selection of letters with diacritical marks as used in several European languages in addition to the traditional US-ASCII characters. Any other character must be represented using the Unicode escape (\uHHHH). This is quite likely to trip you up when you trustingly edit your properties file with an editor that's been educated to understand UTF-8. Although all printable ISO 8859-1 characters with code units greater than 0x7F happen to map to Unicode code points that are numerically equal to these code points, their UTF-8 encoding requires two bytes. (The appearance of 'Â' or 'Ā' in front of some other character is the typical evidence of such a misunderstanding.) Moreover, it's easy to create a character not contained in the ISO 8859-1 set. On my Linux system, emacs lets me produce the trade mark sign (™) with a few keystrokes. For a remedy, the same javadoc explains that the tool native2ascii accompanying JDK may be used to convert a file from any encoding to ISO 8859-1 with Unicode escapes.

The Properties methods loadFromXML(InputStream) and storeToXML(OutputStream, String, String) read and write XML data, which should indicate its encoding in the XML declaration. It may be more convenient to use these methods than the edit-and-convert rigmarole required for a simple character stream.

Writing Text Files

We call a file a "text file" if its data is meant to be a sequence of lines containing characters. While any programming language may have its individual concept of the set of characters it handles as a value of a character data type and a singular way of representing a character in memory), things aren't quite as simple as soon as you have to entrust your data to a file system. Other programs, on the same or on another system, should be able to read that data and be able to interpret it so that they may come up with the same set of characters. Standards institutes and vendors have created an overly rich set of encodings, prescriptions for mapping byte sequences to character sequences. On top of that, there are the various escape mechanisms which let you represent characters not contained in the basic set as sequences of characters from that set. The latter is an issue of interpretation according to various text formats, such as XML or HTML, and we'll skip it here.

Writing a sequence of characters and line terminators to a file should be a simple exercise, and the API of java.io does indeed provide all the essentials, but there two things to consider. First, what should become of a "character" when it is stored on the medium or sent over the wire; second, how are lines separated.

If the set of characters in the text goes beyond what can be represented with one of the legacy encodings that use one 8-bit code unit per character, one of the Unicode encoding schemes UTF-8, UTF-16 or UTF-32 must be chosen, and it should be set explicitly as it is risky to rely on the default stored in the system property file.encoding.

Which one should you choose, provided you have a choice at all? If size matters, consider that UTF-16 produces 2 bytes per character, whereas UTF-8 is a variable-width encoding, requiring 1, 2, 3 or more bytes for each codepoint. Thus, if your text used characters from US-ASCII only, the ratio between UTF-8 and UTF-16 will be 1:2, and if you are writing an Arabic text, the ratio is bound to be 1:1, and for CJK it will be 3:2. Compressing the files narrows the distance considerably. Anyway, UTF-8 has become the dominant character encoding for the World-Wide Web, it is increasingly used as the default in operating systems. Therefore it's hard to put up a case against using this very flexible encoding.

Conclusion

Delving into Unicode's mysteries is a highly rewarding adventure. We have seen that Java provides some support for text processing according to the Unicode standard, but you should always keep in mind that this support may not be sufficient for more sophisticated applications. This has been one of the two motives for writing this article. And what was the other one? Ah, yes, having fun!

Kind regards

Wolfgang

 

Comments

We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)

When you load these comments, you'll be connected to Disqus. Privacy Statement.

Related Articles

Browse the Newsletter Archive

About the Author

Heinz Kabutz Java Conference Speaker

Java Champion, author of the Javaspecialists Newsletter, conference speaking regular... About Heinz

Superpack '23

Superpack '24 Our entire Java Specialists Training in one huge bundle more...

Free Java Book

Dynamic Proxies in Java Book
Java Training

We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.

Java Consulting

We can help make your Java application run faster and trouble-shoot concurrency and performance bugs...