Abstract: Java's serialization mechanism is optimized for immutable objects. Writing objects without resetting the stream causes a memory leak. Writing a changed object twice results in only the first state being written. However, resetting the stream also loses the optimization stored in the stream.
Welcome to the 166th issue of The Java(tm) Specialists' Newsletter, sent to you from the beautiful island of Crete. We recently picked up tennis racquets again, a great sport to play when you have such an ample supply of fine weather. To make it more enjoyable, I've started taking private tennis lessons from a friend who used to play professionally. When it comes to sport, I'm your typical computer geek. Fortunately my coach is patient, even exaggerating her praise somewhat.
Last weekend we hopped onto the ferry and traveled to Athens for the day to, amongst other things, attend the Java Hellenic User Group. Considering how many are facing tough times with unemployment cutting through many companies, I decided to talk about my story. I became self-employed in November 1998. We went to a bank last week to apply for a home loan and were treated like two unemployed hobos. Business owners are considered with suspicion by people who have jobs, even if our income is more secure than theirs. Have a look here for the talk Un^H^HSelf-Employed (10MB).
javaspecialists.teachable.com: Please visit our new self-study course catalog to see how you can upskill your Java knowledge.
   One of my favourite demonstrations is to show how Java is
   supposedly faster at writing data than any other language.
   We open an ObjectOutputStream, then write a
   large byte array to the stream millions of times. In our 
   code, we are writing 1000 terabyte in under 4 seconds.  We
   need to make sure we run the program with 256 megabytes of
   memory, using the -Xmx256m flag.
   
import java.io.*;
public class SuperFastWriter {
  private static final long TERA_BYTE =
      1024L * 1024 * 1024 * 1024;
  public static void main(String... args) throws IOException {
    long bytesWritten = 0;
    byte[] data = new byte[100 * 1024 * 1024];
    ObjectOutputStream out = new ObjectOutputStream(
        new BufferedOutputStream(
            new FileOutputStream("bigdata.bin")
        )
    );
    long time = System.currentTimeMillis();
    for (int i = 0; i < 10 * 1024 * 1024; i++) {
      out.writeObject(data);
      bytesWritten += data.length;
    }
    out.writeObject(null);
    out.close();
    time = System.currentTimeMillis() - time;
    System.out.printf("Wrote %d TB%n", bytesWritten / TERA_BYTE);
    System.out.println("time = " + time);
  }
}
The code completes in under four seconds on my MacBook Pro:
    heinz$ java -Xmx256m SuperFastWriter
    Wrote 1000 TB
    time = 3710
At this point you must be wondering what type of disk I have in my MacBook Pro, to have 1000 terabytes free and to be able to write so fast? After all, I am writing 250 TB per second! Try it on your machine and you should see similar results.
If we look at the file on our disk, we see that it only uses 150 megabytes of space. Whenever we serialize an object with the ObjectOutputStream, it is cached in an identity hash table. When we write the object again, only a pointer to it is written. Something similar happens on reading. When we read an object, it is put in a local identity hash table, mapping the pointer to the object. Future reads of the pointer simply return the first object that was read. This minimizes the data that needs to be written and solves the circular dependency problem.
import java.io.*;
public class SuperFastReader {
  private static final long TERA_BYTE =
    1024L * 1024 * 1024 * 1024;
  public static void main(String... args) throws Exception {
    long bytesRead = 0;
    ObjectInputStream in = new ObjectInputStream(
        new BufferedInputStream(
            new FileInputStream("bigdata.bin")
        )
    );
    long time = System.currentTimeMillis();
    byte[] data;
    while ((data = (byte[]) in.readObject()) != null) {
      bytesRead += data.length;
    }
    in.close();
    time = System.currentTimeMillis() - time;
    System.out.printf("Read %d TB%n", bytesRead / TERA_BYTE);
    System.out.println("time = " + time);
  }
}
This program appears to read 1000 terabyte of data in just a few seconds, which we know is impossible on my little laptop:
    Read 1000 TB
    time = 2033
   Our next experiment is to fill the byte[] with
   data and then write it repeatedly.  Since the
   Arrays.fill() method is quite slow, we just
   write 256 large arrays.
   
import java.io.*;
import java.util.Arrays;
public class ModifiedObjectWriter {
  public static void main(String... args) throws IOException {
    byte[] data = new byte[10 * 1024 * 1024];
    ObjectOutputStream out = new ObjectOutputStream(
        new BufferedOutputStream(
            new FileOutputStream("smalldata.bin")
        )
    );
    for (int i = -128; i < 128; i++) {
      Arrays.fill(data, (byte) i);
      out.writeObject(data);
    }
    out.writeObject(null);
    out.close();
  }
}
The ModifiedObjectWriter creates a file smalldata.bin, containing the byte arrays. Let's see what happens when we read the data back again:
import java.io.*;
public class ModifiedObjectReader {
  public static void main(String... args) throws Exception {
    ObjectInputStream in = new ObjectInputStream(
        new BufferedInputStream(
            new FileInputStream("smalldata.bin")
        )
    );
    byte[] data;
    while ((data = (byte[]) in.readObject()) != null) {
      System.out.println(data[0]);
    }
    in.close();
  }
}
Instead of seeing the numbers -128, -127, -126, etc, we only see the number -128. When we modify the contents of an object and then write that object again to the stream, the serialization mechanism sees that it is the same object again and just writes a pointer to the object. On the reading side, it reads the byte array once, containing -128, and stores that in its local identity hash table. When it reads the pointer to the object, it just returns the object from its local table. The serialization mechanism cannot know whether an object was changed.
Here is a rule to remember: Never serialize mutable objects. We could relax this rule a little bit to: Don't re-serialize modified objects without first resetting the stream. Another approach is to always copy mutable objects before serialization. None of these relaxations will necessarily give you the intended behaviour.
Let's see what happens if we create new objects every time. In our ModifiedObjectWriter2 we create a new byte array every time, fill it with data, then write it to the stream:
import java.io.*;
import java.util.Arrays;
public class ModifiedObjectWriter2 {
  public static void main(String... args) throws IOException {
    ObjectOutputStream out = new ObjectOutputStream(
        new BufferedOutputStream(
            new FileOutputStream("verylargedata.bin")
        )
    );
    for (int i = -128; i < 128; i++) {
      byte[] data = new byte[10 * 1024 * 1024];
      Arrays.fill(data, (byte) i);
      out.writeObject(data);
    }
    out.writeObject(null);
    out.close();
  }
}
Within a few seconds you should see an OutOfMemoryError. On reading, the object is put into a local identity hash table, where it remains until the stream is reset. This leads to a memory leak if we continue writing (or reading) new objects. You can observe that the generated file will be approximately as large as you have memory available for your virtual machine.
If we reset the stream every time that we write an object, we avoid the OutOfMemoryError. In addition, as the table is flushed after every write, we avoid the problem where changes to an object are not written across the stream.
import java.io.*;
import java.util.Arrays;
public class ModifiedObjectWriter3 {
  public static void main(String... args) throws IOException {
    ObjectOutputStream out = new ObjectOutputStream(
        new BufferedOutputStream(
            new FileOutputStream("verylargedata.bin")
        )
    );
    byte[] data = new byte[10 * 1024 * 1024];
    for (int i = -128; i < 128; i++) {
      Arrays.fill(data, (byte) i);
      out.writeObject(data);
      out.reset();
    }
    out.writeObject(null);
    out.close();
  }
}
Unfortunately, resetting the stream will indiscriminately flush all the objects from the identity hash table, even immutable objects.
The design of the ObjectOutputStream and ObjectInputStream are great in that they minimize unnecessary transfer of objects. This is particularly useful when transmitting immutable objects, such as Strings. Unfortunately in its raw state, it often caused OutOfMemoryError or incomplete data transmission.
An approach used by RMI is to serialize the parameters into a byte array and then to send that across the socket. This way we avoid the problems of the memory leak and of changes not being sent across.
Kind regards
Heinz
We are always happy to receive comments from our readers. Feel free to send me a comment via email or discuss the newsletter in our JavaSpecialists Slack Channel (Get an invite here)
We deliver relevant courses, by top Java developers to produce more resourceful and efficient programmers within their organisations.