|
The Java Specialists' Newsletter
Issue 166 2008-11-30
Category:
Tips and Tricks
Java version: Java 5 and 6 Serialization Cacheby Dr. Heinz M. KabutzAbstract:
Java's serialization mechanism is optimized for immutable
objects. Writing objects without resetting the stream
causes a memory leak. Writing a changed object twice
results in only the first state being written. However,
resetting the stream also loses the optimization stored in
the stream.
Welcome to the 166th issue of The Java(tm) Specialists' Newsletter, sent to you from
the beautiful island of Crete. We recently picked up tennis
racquets again, a great sport to play when you have
such an ample supply of fine weather. To make it more
enjoyable, I've started taking private tennis lessons from
a friend who used to play professionally. When it comes to
sport, I'm your typical computer geek. Fortunately my coach
is patient, even exaggerating her praise somewhat.
Last weekend we hopped onto the ferry and traveled to Athens
for the day to, amongst other things, attend the Java
Hellenic User Group. Considering how many are facing tough
times with unemployment cutting through many companies, I
decided to talk about my story. I became self-employed in
November 1998. We went to a bank last week to apply for a
home loan and were treated like two unemployed hobos.
Business owners are considered with suspicion by people who
have jobs, even if our income is more secure than theirs.
Have a look here for the talk Un^H^HSelf-Employed (10MB).
Upcoming Java Specialist Master Courses:
- please click here to sign up.
As from May 2010, we are also offering this course on the island of Crete. We
only accept 6 students per class in Crete, due to the size of our conference
room. Please book early to avoid disappointment!
San Jose CA, Mar 16-19 2010, $3500 Ottawa, Canada, Mar 22-25 2010, $3500 Oslo, Norway, Apr 13-16 2010, Kr 24500 Montreal, Canada, Apr 20-23 2010, $3500 Toronto, Canada, May 17-20 2010, $3500 Chania, Crete, May 25-28, Jun 29-Jul 2 or Aug 24-27 2010, €2500
In-house courses if these dates or locations do not suit you - click here for more information. Serialization Cache
One of my favourite demonstrations is to show how Java is
supposedly faster at writing data than any other language.
We open an ObjectOutputStream, then write a
large byte array to the stream millions of times. In our
code, we are writing 1000 terabyte in under 4 seconds. We
need to make sure we run the program with 256 megabytes of
memory, using the -Xmx256m flag.
import java.io.*;
public class SuperFastWriter {
private static final long TERA_BYTE =
1024L * 1024 * 1024 * 1024;
public static void main(String[] args) throws IOException {
long bytesWritten = 0;
byte[] data = new byte[100 * 1024 * 1024];
ObjectOutputStream out = new ObjectOutputStream(
new BufferedOutputStream(
new FileOutputStream("bigdata.bin")
)
);
long time = System.currentTimeMillis();
for (int i = 0; i < 10 * 1024 * 1024; i++) {
out.writeObject(data);
bytesWritten += data.length;
}
out.writeObject(null);
out.close();
time = System.currentTimeMillis() - time;
System.out.printf("Wrote %d TB%n", bytesWritten / TERA_BYTE);
System.out.println("time = " + time);
}
}
The code completes in under four seconds on my MacBook Pro:
heinz$ java -Xmx256m SuperFastWriter
Wrote 1000 TB
time = 3710
At this point you must be wondering what type of disk I have
in my MacBook Pro, to have 1000 terabytes free and to be able
to write so fast? After all, I am writing 250 TB per second!
Try it on your machine and you should see similar results.
If we look at the file on our disk, we see that it only uses
150 megabytes of space. Whenever we serialize an object with
the ObjectOutputStream, it is cached in an identity hash
table. When we write the object
again, only a pointer to it is written. Something similar
happens on reading. When we read an object, it is put in a
local identity hash table, mapping the pointer to the object.
Future reads of the pointer simply return the first object
that was read. This minimizes the data that needs to be
written and solves the circular dependency problem.
import java.io.*;
public class SuperFastReader {
private static final long TERA_BYTE =
1024L * 1024 * 1024 * 1024;
public static void main(String[] args) throws Exception {
long bytesRead = 0;
ObjectInputStream in = new ObjectInputStream(
new BufferedInputStream(
new FileInputStream("bigdata.bin")
)
);
long time = System.currentTimeMillis();
byte[] data;
while ((data = (byte[]) in.readObject()) != null) {
bytesRead += data.length;
}
in.close();
time = System.currentTimeMillis() - time;
System.out.printf("Read %d TB%n", bytesRead / TERA_BYTE);
System.out.println("time = " + time);
}
}
This program appears to read 1000 terabyte of data in just a
few seconds, which we know is impossible on my little laptop:
Read 1000 TB
time = 2033
Our next experiment is to fill the byte[] with
data and then write it repeatedly. Since the
Arrays.fill() method is quite slow, we just
write 256 large arrays.
import java.io.*;
import java.util.Arrays;
public class ModifiedObjectWriter {
public static void main(String[] args) throws IOException {
byte[] data = new byte[10 * 1024 * 1024];
ObjectOutputStream out = new ObjectOutputStream(
new BufferedOutputStream(
new FileOutputStream("smalldata.bin")
)
);
for (int i = -128; i < 128; i++) {
Arrays.fill(data, (byte) i);
out.writeObject(data);
}
out.writeObject(null);
out.close();
}
}
The ModifiedObjectWriter creates a file smalldata.bin,
containing the byte arrays. Let's see what happens when we
read the data back again:
import java.io.*;
public class ModifiedObjectReader {
public static void main(String[] args) throws Exception {
ObjectInputStream in = new ObjectInputStream(
new BufferedInputStream(
new FileInputStream("smalldata.bin")
)
);
byte[] data;
while ((data = (byte[]) in.readObject()) != null) {
System.out.println(data[0]);
}
in.close();
}
}
Instead of seeing the numbers -128, -127, -126, etc, we only
see the number -128. When we modify the contents of an
object and then write that object again to the stream, the
serialization mechanism sees that it is the same object again
and just writes a pointer to the object. On the reading
side, it reads the byte array once, containing -128, and
stores that in its local identity hash table. When it reads
the pointer to the object, it just returns the object from
its local table. The serialization mechanism cannot know
whether an object was changed.
Here is a rule to remember: Never serialize mutable
objects. We could relax this rule a little bit to:
Don't re-serialize modified objects without first
resetting the stream. Another approach is to always
copy mutable objects before serialization. None of these
relaxations will necessarily give you the intended behaviour.
Let's see what happens if we create new objects every
time. In our ModifiedObjectWriter2 we create a new byte
array every time, fill it with data, then write it to the
stream:
import java.io.*;
import java.util.Arrays;
public class ModifiedObjectWriter2 {
public static void main(String[] args) throws IOException {
ObjectOutputStream out = new ObjectOutputStream(
new BufferedOutputStream(
new FileOutputStream("verylargedata.bin")
)
);
for (int i = -128; i < 128; i++) {
byte[] data = new byte[10 * 1024 * 1024];
Arrays.fill(data, (byte) i);
out.writeObject(data);
}
out.writeObject(null);
out.close();
}
}
Within a few seconds you should see an OutOfMemoryError. On
reading, the object is put into a local identity hash table,
where it remains until the stream is reset. This leads to a
memory leak if we continue writing (or reading) new objects.
You can observe that the generated file will be approximately
as large as you have memory available for your virtual
machine.
If we reset the stream every time that we write an object,
we avoid the OutOfMemoryError. In addition, as the table
is flushed after every write, we avoid the problem where
changes to an object are not written across the stream.
import java.io.*;
import java.util.Arrays;
public class ModifiedObjectWriter3 {
public static void main(String[] args) throws IOException {
ObjectOutputStream out = new ObjectOutputStream(
new BufferedOutputStream(
new FileOutputStream("verylargedata.bin")
)
);
byte[] data = new byte[10 * 1024 * 1024];
for (int i = -128; i < 128; i++) {
Arrays.fill(data, (byte) i);
out.writeObject(data);
out.reset();
}
out.writeObject(null);
out.close();
}
}
Unfortunately, resetting the stream will indiscriminately
flush all the objects from the identity hash table,
even immutable objects.
The design of the ObjectOutputStream and ObjectInputStream
are great in that they minimize unnecessary transfer of
objects. This is particularly useful when transmitting
immutable objects, such as Strings. Unfortunately in its
raw state, it often caused OutOfMemoryError or incomplete
data transmission.
An approach used by RMI is to serialize the parameters into a
byte array and then to send that across the socket. This way
we avoid the problems of the memory leak and of changes not
being sent across.
Kind regards
Heinz
Tips and Tricks Articles
Related Java Course
|