Tuesday, January 16, 2018

Using Google's Protocol Buffers with Java

Effective Java, Third Edition was recently released and I have been interested in identifying the updates to this class Java development book whose last edition only covered through Java 6. There are obviously completely new items in this edition that are closely related to Java 7, Java 8, and Java 9 such as Items 42 through 48 in Chapter 7 ("Lambdas and Streams"), Item 9 ("Prefer try-with-resources to try-finally"), and Item 55 ("Return optionals judiciously"). I was (very slightly) surprised to realize that the third edition of Effective Java had a new item not specifically driven by the new versions of Java, but that was instead was driven by developments in the software development world independent of the versions of Java. That item, Item 85 ("Prefer alternatives to Java Serialization") is what motivated me to write this introductory post on using Google's Protocol Buffers with Java.

In Item 85 of Effective Java, Third Edition, Josh Bloch emphasizes in bold text the following two assertions related to Java serialization:

  1. "The best way to avoid serialization exploits is to never deserialize anything."
  2. "There is no reason to use Java serialization in any new system you write."

After outlining the dangers of Java deserialization and making these bold statements, Bloch recommends that Java developers employ what he calls (to avoid confusion associated with the term "serialization" when discussing Java) "cross-platform structured-data representations." Bloch states that the leading offerings in this category are JSON (JavaScript Object Notation) and Protocol Buffers (protobuf). I found this mention of Protocol Buffers to be interesting because I've been reading about and playing with Protocol Buffers a bit lately. The use of JSON (even with Java) is exhaustively covered online. I feel like awareness of Protocol Buffers may be less among Java developers than awareness of JSON and so feel like a post on using Protocol Buffers with Java is warranted.

Google's Protocol Buffers is described on its project page as "a language-neutral, platform-neutral extensible mechanism for serializing structured data." That page adds, "think XML, but smaller, faster, and simpler." Although one of the advantages of Protocol Buffers is that they support representing data in a way that can be used by multiple programming languages, the focus of this post is exclusively on using Protocol Buffers with Java.

There are several useful online resources related to Protocol Buffers including the main project page, the GitHub protobuf project page, the proto3 Language Guide (proto2 Language Guide is also available), the Protocol Buffer Basics: Java tutorial, the Java Generated Code Guide, the Java API (Javadoc) Documentation, the Protocol Buffers release page, and the Maven Repository page. The examples in this post are based on Protocol Buffers 3.5.1.

The Protocol Buffer Basics: Java tutorial outlines the process for using Protocol Buffers with Java. It covers a lot more possibilities and things to consider when using Java than I will cover here. The first step is to define the language-independent Protocol Buffers format. This a done in a text file with the .proto extension. For my example, I've described my protocol format in the file album.proto which is shown in the next code listing.

album.proto

syntax = "proto3";

option java_outer_classname = "AlbumProtos";
option java_package = "dustin.examples.protobuf";

message Album
{
  string title = 1;
  repeated string artist = 2;
  int32 release_year = 3;
  repeated string song_title = 4;
}

Although the above definition of a protocol format is simple, there's a lot covered. The first line explicitly states that I'm using proto3 instead of the assumed default proto2 that is currently used when this is not explicitly specified. The two lines beginning with option are only of interest when using this protocol format to generate Java code and they indicate the name of the outermost class and the package of that outermost class that will be generated for use by Java applications to work with this protocol format.

The "message" keyword indicates that this structure, named "Album" here, is what needs to be represented. There are four fields in this construct with three of them being string format and one being an integer (int32). Two of the four fields can exist more than once in a given message because they are annotated with the repeated reserved word. Note that I created this definition without considering Java except for the two options that specify details of generation of Java classes from this format specification.

The album.proto file shown above now needs to be "compiled" into the Java source class file (AlbumProtos.java in the dustin.examples.protobuf package) that will allow for writing and reading Protocol Buffers's binary format that corresponds to the defined protocol format. This generation of Java source code file is accomplished using the protoc compiler that is included in the appropriate operating system-based archive file. In my case, because I'm running this example in Windows 10, I downloaded and unzipped protoc-3.5.1-win32.zip to get access to this protoc tool. The next image depicts my running protoc against album.proto with the command protoc --proto_path=src --java_out=dist\generated album.proto.

For running the above, I had my album.proto file in the src directory pointed to by --proto_path and I had a created (but empty) directory called build\generated for the generated Java source code to be placed in as specified by --java_out flag.

The generated class's Java source code file AlbumProtos.java in the specified package has more than 1000 lines and I won't list that generated class source code here, but it's available on GitHub. Among the several interesting things to note about this generated code is the lack of import statements (fully qualified package names used instead for all class references). More details regarding the Java source code generated by protoc is available in the Java Generated Code guide. It's important to note that this generated class AlbumProtos has still not been influenced by any of my own Java application code and is solely generated from the album.proto text file shown earlier in the post.

With the generated Java source code available for AlbumProtos, I now add the directory in which this class was generated to my IDE's source path because I'm treating it as a source code file now. I could have alternatively compiled it into a .class or .jar to use as a library. With this generated Java source code file now in my source path, I can build it alongside my own code.

Before going further in this example, we need a simple Java class to represent with Protocol Buffers. For this, I'll use the class Album that is defined in the next code listing (also available on GitHub).

Album.java

package dustin.examples.protobuf;

import java.util.ArrayList;
import java.util.List;

/**
 * Music album.
 */
public class Album
{
   private final String title;

   private final List<String> artists;

   private final int releaseYear;

   private final List<String> songsTitles;

   private Album(final String newTitle, final List<String> newArtists,
                 final int newYear, final List<String> newSongsTitles)
   {
      title = newTitle;
      artists = newArtists;
      releaseYear = newYear;
      songsTitles = newSongsTitles;
   }

   public String getTitle()
   {
      return title;
   }

   public List<String> getArtists()
   {
      return artists;
   }

   public int getReleaseYear()
   {
      return releaseYear;
   }

   public List<String> getSongsTitles()
   {
      return songsTitles;
   }

   @Override
   public String toString()
   {
      return "'" + title + "' (" + releaseYear + ") by " + artists + " features songs " + songsTitles;
   }

   /**
    * Builder class for instantiating an instance of
    * enclosing Album class.
    */
   public static class Builder
   {
      private String title;
      private ArrayList<String> artists = new ArrayList<>();
      private int releaseYear;
      private ArrayList<String> songsTitles = new ArrayList<>();

      public Builder(final String newTitle, final int newReleaseYear)
      {
         title = newTitle;
         releaseYear = newReleaseYear;
      }

      public Builder songTitle(final String newSongTitle)
      {
         songsTitles.add(newSongTitle);
         return this;
      }

      public Builder songsTitles(final List<String> newSongsTitles)
      {
         songsTitles.addAll(newSongsTitles);
         return this;
      }

      public Builder artist(final String newArtist)
      {
         artists.add(newArtist);
         return this;
      }

      public Builder artists(final List<String> newArtists)
      {
         artists.addAll(newArtists);
         return this;
      }

      public Album build()
      {
         return new Album(title, artists, releaseYear, songsTitles);
      }
   }
}

With a Java "data" class defined (Album) and with a Protocol Buffers-generated Java class available for representing this album (AlbumProtos.java), I'm ready to write Java application code to "serialize" the album information without using Java serialization. This application (demonstration) code resides in the AlbumDemo class which is available on GitHub and from which I'll highlight relevant portions of in this post.

We need to generate a sample instance of Album to use in this example and this is accomplished with the next hard-coded listing.

Generating Sample Instance of Album

/**
 * Generates instance of Album to be used in demonstration.
 *
 * @return Instance of Album to be used in demonstration.
 */
public Album generateAlbum()
{
   return new Album.Builder("Songs from the Big Chair", 1985)
      .artist("Tears For Fears")
      .songTitle("Shout")
      .songTitle("The Working Hour")
      .songTitle("Everybody Wants to Rule the World")
      .songTitle("Mothers Talk")
      .songTitle("I Believe")
      .songTitle("Broken")
      .songTitle("Head Over Heels")
      .songTitle("Listen")
      .build();
}

The Protocol Buffers generated class AlbumProtos includes a nested AlbumProtos.Album class that I'll be using to store the contents of my Album instance in binary form. The next code listing demonstrates how this is done.

Instantiating AlbumProtos.Album from Album

final Album album = instance.generateAlbum();
final AlbumProtos.Album albumMessage
   = AlbumProtos.Album.newBuilder()
      .setTitle(album.getTitle())
      .addAllArtist(album.getArtists())
      .setReleaseYear(album.getReleaseYear())
      .addAllSongTitle(album.getSongsTitles())
      .build();

As the previous code listing demonstrates, a "builder" is used to populate the immutable instance of the class generated by Protocol Buffers. With a reference to this instance, I can now easily write the contents of the instance out in Protocol Buffers's binary form using the method toByteArray() on that instance as shown in the next code listing.

Writing Binary Form of AlbumProtos.Album

final byte[] binaryAlbum = albumMessage.toByteArray();

Reading a byte[] array back into an instance of Album can be accomplished as shown in the next code listing.

Instantiating Album from Binary Form of AlbumProtos.Album

/**
 * Generates an instance of Album based on the provided
 * bytes array.
 *
 * @param binaryAlbum Bytes array that should represent an
 *    AlbumProtos.Album based on Google Protocol Buffers
 *    binary format.
 * @return Instance of Album based on the provided binary form
 *    of an Album; may be {@code null} if an error is encountered
 *    while trying to process the provided binary data.
 */
public Album instantiateAlbumFromBinary(final byte[] binaryAlbum)
{
   Album album = null;
   try
   {
      final AlbumProtos.Album copiedAlbumProtos = AlbumProtos.Album.parseFrom(binaryAlbum);
      final List<String> copiedArtists = copiedAlbumProtos.getArtistList();
      final List<String> copiedSongsTitles = copiedAlbumProtos.getSongTitleList();
      album = new Album.Builder(
         copiedAlbumProtos.getTitle(), copiedAlbumProtos.getReleaseYear())
         .artists(copiedArtists)
         .songsTitles(copiedSongsTitles)
         .build();
   }
   catch (InvalidProtocolBufferException ipbe)
   {
      out.println("ERROR: Unable to instantiate AlbumProtos.Album instance from provided binary data - "
         + ipbe);
   }
   return album;
}

As indicated in the last code listing, a checked exception InvalidProtocolBufferException can be thrown during the invocation of the static method parseFrom(byte[]) defined in the generated class. Obtaining a "deserialized" instance of the generated class is essentially a single line and the rest of the lines are getting data out of the instantiation of the generated class and setting that data in the original Album class's instance.

The demonstration class includes two lines that print out the contents of the original Album instance and the instance ultimately retrieved from the binary representation. These two lines include invocations of System.identityHashCode() on the two instances to prove that they are not the same instance even though their contents match. When this code is executed with the hard-coded Album instance details shown earlier, the output looks like this:

BEFORE Album (1323165413): 'Songs from the Big Chair' (1985) by [Tears For Fears] features songs [Shout, The Working Hour, Everybody Wants to Rule the World, Mothers Talk, I Believe, Broken, Head Over Heels, Listen]
 AFTER Album (1880587981): 'Songs from the Big Chair' (1985) by [Tears For Fears] features songs [Shout, The Working Hour, Everybody Wants to Rule the World, Mothers Talk, I Believe, Broken, Head Over Heels, Listen]

From this output, we see that the relevant fields are the same in both instances and that the two instances truly are unique. This is a bit more work than using Java's "nearly automatic" Serialization mechanism implementing the Serializable interface, but there are important advantages associated with this approach that can justify the cost. In Effective Java, Third Edition, Josh Bloch discusses the security vulnerabilities associated with deserialization in Java's default mechanism and asserts that "There is no reason to use Java serialization in any new system you write."

No comments: