Wednesday, July 20, 2016

JAXB and Log4j XML Configuration Files

Both Log4j 1.x and Log4j 2.x support use of XML files to specify logging configuration. This post looks into some of the nuances and subtleties associated with using JAXB to work with these XML configuration files via Java classes. The examples in this post are based on Apache Log4j 1.2.17, Apache Log4j 2.6.2, and Java 1.8.0_73 with JAXB xjc 2.2.8-b130911.1802.

Log4j 1.x : log4j.dtd

Log4j 1.x's XML grammar is defined by a DTD instead of an W3C XML Schema. Fortunately, the JAXB implementation that comes with the JDK provides an "experimental,unsupported" option for using DTDs as the input from which Java classes are generated. The following command can be used to run the xjc command-line tool against the log4j.dtd.

    xjc -p dustin.examples.l4j1 -d src -dtd log4j.dtd

The next screen snapshot demonstrates this.

Running the command described above and demonstrated in the screen snapshot leads to Java classes being generated in a Java package in the src directory called dustin.examples.l4fj1 that allow for unmarshalling from log4j.dtd-compliant XML and for marshalling to log4j.dtd-compliant XML.

Log4j 2.x : Log4j-config.xsd

Log4j 2.x's XML configuration can be either "concise" or "strict" and I need to use "strict" in this post because that is the form that uses a grammar defined by the W3C XML Schema file Log4j-config.xsd and I need a schema to generate Java classes with JAXB. The following command can be run against this XML Schema to generate Java classes representing Log4j2 strict XML.

    xjc -p dustin.examples.l4j2 -d src Log4j-config.xsd -b l4j2.jxb

Running the above command leads to Java classes being generated in a Java package in the src directory called dustin.examples.l4j2 that allow for unmarshalling from Log4j-config.xsd-compliant XML and for marshalling to Log4j-config.xsd-compliant XML.

In the previous example, I included a JAXB binding file with the option -b followed by the name of the binding file (-b l4j2.jxb). This binding was needed to avoid an error that prevented xjc from generated Log4j 2.x-compliant Java classes with the error message, "Property "Value" is already defined. Use <jaxb:property> to resolve this conflict." This issue and how to resolve it are discussed in A Brit in Bermuda's post Property "Value" is already defined. Use to resolve this conflict. The source for the JAXB binding file I used here is shown next.

l4j2.jxb

<jxb:bindings version="2.0"
              xmlns:jxb="http://java.sun.com/xml/ns/jaxb"
              xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <jxb:bindings schemaLocation="Log4j-config.xsd" node="/xsd:schema">
      <jxb:bindings node="//xsd:complexType[@name='KeyValuePairType']">
         <jxb:bindings node=".//xsd:attribute[@name='value']">
            <jxb:property name="pairValue"/>
         </jxb:bindings>
      </jxb:bindings>
   </jxb:bindings>
</jxb:bindings>

The JAXB binding file just shown allows xjc to successfully parse the XSD and generate the Java classes. The one small price to pay (besides writing and referencing the binding file) is that the "value" attribute of the KeyValuePairType will need to be accessed in the Java class as a field named pairValue instead of value.

Unmarshalling Log4j 1.x XML

A potential use case for working with JAXB-generated classes for Log4j 1.x's log4j.dtd and Log4j 2.x's Log-config.xsd is conversion of Log4j 1.x XML configuration files to Log4j 2.x "strict" XML configuration files. In this situation, one would need to unmarshall Log4j 1.x log4j.dtd-compliant XML and marshall Log4j 2.x Log4j-config.xsd-compliant XML.

The following code listing demonstrates how the Log4j 1.x XML might be unmarshalled using the previously generated JAXB classes.

   /**
    * Extract the contents of the Log4j 1.x XML configuration file
    * with the provided path/name.
    *
    * @param log4j1XmlFileName Path/name of Log4j 1.x XML config file.
    * @return Contents of Log4j 1.x configuration file.
    * @throws RuntimeException Thrown if exception occurs that prevents
    *    extracting contents from XML with provided name.
    */
   public Log4JConfiguration readLog4j1Config(final String log4j1XmlFileName)
      throws RuntimeException
   {
      Log4JConfiguration config;
      try
      {
         final File inputFile = new File(log4j1XmlFileName);
         if (!inputFile.isFile())
         {
            throw new RuntimeException(log4j1XmlFileName + " is NOT a parseable file.");
         }

         final SAXParserFactory spf = SAXParserFactory.newInstance();
         final SAXParser sp = spf.newSAXParser();
         final XMLReader xr = sp.getXMLReader();
         
         final JAXBContext jaxbContext = JAXBContext.newInstance("dustin.examples.l4j1");
         final Unmarshaller unmarshaller = jaxbContext.createUnmarshaller();
         final UnmarshallerHandler unmarshallerHandler = unmarshaller.getUnmarshallerHandler();
         xr.setContentHandler(unmarshallerHandler);

         final FileInputStream xmlStream = new FileInputStream(log4j1XmlFileName);
         final InputSource xmlSource = new InputSource(xmlStream);
         xr.parse(xmlSource);

         final Object unmarshalledObject = unmarshallerHandler.getResult();
         config = (Log4JConfiguration) unmarshalledObject;
      }
      catch (JAXBException | ParserConfigurationException | SAXException | IOException exception)
      {
         throw new RuntimeException(
            "Unable to read from file " + log4j1XmlFileName + " - " + exception,
            exception);
      }
      return config;
   }

Unmarshalling this Log4j 1.x XML was a bit trickier than some XML unmarshalling because of the nature of log4j.dtd's namespace treatment. This approach for dealing with this wrinkle is described in Gik's Jaxb UnMarshall without namespace and in Deepa S's How to instruct JAXB to ignore Namespaces. Using this approach helped avoid the error message:

UnmarshalException: unexpected element (uri:"http://jakarta.apache.org/log4j/", local:"configuration"). Expected elements ...

To unmarshall the Log4j 1.x that in my case references log4j.dtd on the filesystem, I needed to provide a special Java system property to the Java launcher when running this code with Java 8. Specifically, I needed to specify
     -Djavax.xml.accessExternalDTD=all
to avoid the error message, "Failed to read external DTD because 'file' access is not allowed due to restriction set by the accessExternalDTD property." Additional details on this can be found at NetBeans's FaqWSDLExternalSchema Wiki page.

Marshalling Log4j 2.x XML

Marshalling Log4j 2.x XML using the JAXB-generated Java classes is fairly straightforward as demonstrated in the following example code:

   /**
    * Write Log4j 2.x "strict" XML configuration to file with
    * provided name based on provided content.
    *
    * @param log4j2Configuration Content to be written to Log4j 2.x
    *    XML configuration file.
    * @param log4j2XmlFile File to which Log4j 2.x "strict" XML
    *    configuration should be written.
    */
   public void writeStrictLog4j2Config(
      final ConfigurationType log4j2Configuration,
      final String log4j2XmlFile)
   {
      try (final OutputStream os = new FileOutputStream(log4j2XmlFile))
      {
         final JAXBContext jc = JAXBContext.newInstance("dustin.examples.l4j2");
         final Marshaller marshaller = jc.createMarshaller();
         marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
         marshaller.marshal(new ObjectFactory().createConfiguration(log4j2Configuration), os);
      }
      catch (JAXBException | IOException exception)
      {
         throw new RuntimeException(
            "Unable to write Log4 2.x XML configuration - " + exception,
            exception);
      }
   }

There is one subtlety in this marshalling case that may not be obvious in the just-shown code listing. The classes that JAXB's xjc generated from the Log4j-config.xsd lack any class with @XmlRootElement. The JAXB classes that were generated from the Log4j 1.x log4j.dtd did include classes with this @XmlRootElement annotation. Because the Log4j 2.x Log4j-config.xsd-based Java classes don't have this annotation, the following error occurs when trying to marshal the ConfigurationType instance directly:

MarshalException - with linked exception: [com.sun.istack.internal.SAXException2: unable to marshal type "dustin.examples.l4j2.ConfigurationType" as an element because it is missing an @XmlRootElement annotation]

To avoid this error, I instead (line 18 of above code listing) marshalled the result of invoking new ObjectFactory().createConfiguration(ConfigurationType) on the passed-in ConfigurationType instance and it is now successfully marshalled.

Conclusion

JAXB can be used to generate Java classes from Log4j 1.x's log4j.dtd and from Log4j 2.x's Log4j-config.xsd, but there are some subtleties and nuances associated with this process to successfully generate these Java classes and to use the generated Java classes to marshal and unmarshal XML.

Friday, July 15, 2016

Apache PDFBox Command-line Tools: No Java Coding Required

In the blog post Apache PDFBox 2, I demonstrated use of Apache PDFBox 2 as a library called from within Java code to manipulate PDFs. It turns out that Apache PDFBox 2 also provides command-line tools that can be used directly from the command-line as-is with no additional Java coding required. There are several command-line tools available and I will demonstrate some of them in this post.

The PDFBox command-line tools are executed by taking advantage of PDFBox's executable JAR (java -jar with Main-Class: org.apache.pdfbox.tools.PDFBox). This is the JAR with "app" in its name and, for this particular blog post, is pdfbox-app-2.0.2.jar. The general format used to invoke these tools in java -jar pdfbox-app-2.0.2.jar <Command> [options] [files].

When the executable JAR is executed without arguments, a form of help is provided that lists the available commands. This is shown in the next screen snapshot.

This screen snapshot shows that this version of Apache PDFBox (2.0.2) advertises support for the "Possible commands" of ConvertColorspace, Decrypt, Encrypt, ExtractText, ExtractImages, OverlayPDF, PrintPDF, PDFDebugger, PDFMerger, PDFReader, PDFSplit, PDFToImage, TextToPDF, and WriteDecodedDoc.

Extracting Text: "ExtractText"

The first command-line tool I am looking at is extracting text from a PDF. I demonstrated using PDFBox to do this from Java code in my previous blog post. Here, I will use PDFBox to do the same thing directly from the command-line with no Java source code in sight. The following operation extracts the text from the PDF Scala by Example. In my previous, post the Java code accessed this PDF online and used PDFBox to extract text from it. In this case, I've downloaded the Scala by Example and am running the PDFBox ExtractText command-line tool against that downloaded PDF stored on my hard drive at C:\pdf\ScalaByExample.pdf.

The command to extract text from the PDF from the command-line using PDFBox is: java -jar pdfbox-app-2.0.2.jar ExtractText C:\pdf\ScalaByExample.pdf. The next two screen snapshots demonstrate running this command and the file it generates. From these screen snapshots, we can see that the text file generated by this command by default has the same name as the source PDF but with a .txt extension. This command supports multiple options including the ability to specify the name of the text file by placing that name after the source PDF's file name and the ability to write the text to the console instead of to a file via the -console flag (from which the output could be redirected). Examples of how to specify a custom text file name and how to direct text to console instead of file are shown next.

  • Explicitly Specifying Text File Name:
    • java -jar pdfbox-app-2.0.2.jar ExtractText C:\pdf\ScalaByExample.pdf C:\pdf\dustin.txt
  • Rendering Text on Console
    • java -jar pdfbox-app-2.0.2.jar ExtractText -console C:\pdf\ScalaByExample.pdf

PDF from Text: "TextToPDF"

When it is desirable to go the other way (start with text as the source and generate a PDF), the command TextToPDF is appropriate. To demonstrate this, I'm using a source text file called doi.txt that contains a portion of the United States Declaration of Independence:

The unanimous Declaration of the thirteen united States of America,

When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness

With a sample text file in place at C:\pdf\doi.txt, PDFBox's TextToPDF can be run against it. The command, java -jar pdfbox-app-2.0.2.jar TextToPDF C:\pdf\doi.pdf C:\pdf\doi.txt (note that the target PDF is listed as the first argument and the source text file in listed as the second argument). The next three screen snapshots demonstrate running this command the successful generation of a PDF from the source text file.

Extracting Images from PDFs: "ExtractImages"

The PDFBox command-line tool ExtractImages makes it as easy to extract images from a PDF as the command-line tool "ExtractText" made it to extract text from a PDF. My demonstration of this capability will extract four images from a PDF I created with images from the Black Hills (and surrounding area) of South Dakota that is called BlackHillsSouthDakotaAndSurroundingSights.pdf. A screen snapshot of this PDF is shown next.

PDFBox can be used to extract the four photographs in this PDF with the command java -jar pdfbox-app-2.0.2.jar ExtractImages C:\pdf\BlackHillsSouthDakotaAndSurroundingSights.pdf as demonstrated in the next screen snapshot.

Running this command as shown in the last screen snapshot extracts the four images from the PDF. Each extracted image is named after the source PDF with a hyphen and counting integer appended to the end of the name. The generated images are also JPEG files with .jpg extensions. In this case, the names of the generated files are thus BlackHillsSouthDakotaAndSurroundingSights-1.jpg, BlackHillsSouthDakotaAndSurroundingSights-2.jpg, BlackHillsSouthDakotaAndSurroundingSights-3.jpg, and BlackHillsSouthDakotaAndSurroundingSights-4.jpg and each is displayed next in the form extracted directly from the PDF.

BlackHillsSouthDakotaAndSurroundingSights-1.jpg BlackHillsSouthDakotaAndSurroundingSights-2.jpg
BlackHillsSouthDakotaAndSurroundingSights-3.jpg BlackHillsSouthDakotaAndSurroundingSights-4.jpg

Encrypting PDF: "Encrypt"

Apache PDFBox makes it easy to encrypt a PDF. For example, I can encrypt the PDF used in the "ExtractImages" example with the following command: java -jar pdfbox-app-2.0.2.jar Encrypt -O DustinWasHere -U DustinWasHere C:\pdf\BlackHillsSouthDakotaAndSurroundingSights.pdf as shown in the next screen snapshot:

Once I've run the encrypt command, I need a password to open this PDF in Adobe Reader:

Decrypting PDF: "Decrypt"

It's just as easy to decrypt this PDF with the command java -jar pdfbox-app-2.0.2.jar Decrypt -password DustinWasHere C:\pdf\BlackHillsSouthDakotaAndSurroundingSights.pdf as shown in the next screen snapshot. The image demonstrates that an InvalidPasswordException is thrown when no password is provided (or the wrong password is provided) for decrypting the PDF and then it shows a successful decryption and I'm once again able to open the PDF in Adobe Reader without password.

Merging PDFs: "PDFMerger"

PDFBox allows multiple PDFs to be merged into a single PDF with the "PDFMerger" command. This is demonstrated in the next screen snapshots by merging the two single-page PDFs mentioned earlier (doi.pdf and BlackHillsSouthDakotaAndSurroundingSights.pdf into a new PDF called third.pdf with the command java -jar pdfbox-app-2.0.2.jar PDFMerger C:\pdf\doi.pdf C:\pdf\BlackHillsSouthDakotaAndSurroundingSights.pdf C:\pdf\third.pdf.

Splitting PDFs: "PDFSplit"

I can split the third.pdf PDF just created with PDFMerger with the command PDFSplit. This is a particularly simple case because the PDF being split is only two pages. The command is demonstrated with the next screen snapshots.

The snapshots demonstrate that the PDFs split out of third.pdf are called third-1.pdf and third-2.pdf.

Conclusion

In this post, I showed several of the command-line utilities available out-of-the-box with no Java coding required. There are a few other command-line utilities available that were not demonstrated here. All of these commands are easily used by running the executable "app" JAR provided with a PDFBox distribution. As command-line utilities, these tools enjoy the advantages of command-line tools including being quick to run and able to be included within scripts and other automated tools. Another benefit of these tools is that, because they are implemented in open source, developers can use the source code for these tools to see how to use the PDFBox APIs in their own applications and tools. Apache PDFBox's command-line tools are freely available and easy-to-use PDF manipulation tools that can be used with no extra Java code being written.

Monday, July 4, 2016

Apache PDFBox 2

Apache PDFBox 2 was released earlier this year and Apache PDFBox 2.0.1 and Apache PDFBox 2.0.2 have since been released. Apache PDFBox is open source (Apache License Version 2) and Java-based (and so is easy to use with wide variety of programming language including Java, Groovy, Scala, Clojure, Kotlin, and Ceylon). Apache PDFBox can be used by any of these or other JVM-based languages to read, write, and work with PDF documents.

Apache PDFBox 2 introduces numerous bug fixes in addition to completed tasks and some new features. Apache PDFBox 2 now requires Java SE 6 (J2SE 5 was minimum for Apache PDFBox 1.x). There is a migration guide, Migration to PDFBox 2.0.0, that details many differences between PDFBox 1.8 and PDFBox 2.0, including updated dependencies (Bouncy Castle 1.53 and Apache Commons Logging 1.2) and "breaking changes to the library" in PDFBox 2.

PDFBox can be used to create PDFs. The next code listing is adapted from the Apache PDFBox 1.8 example "Create a blank PDF" in the Document Creation "Cookbook" examples. The referenced example explicitly closes the instantiated PDDocument and probably does so for benefit of those using a version of Java before JDK 7. For users of Java 7, however, try-with-resources is a better option for ensuring that the PDDocument instance is closed and it is supported because PDDocument implements AutoCloseable.

Creating (Empty) PDF
/**
 * Demonstrate creation of an empty PDF.
 */
private void createEmptyDocument()
{
   try (final PDDocument document = new PDDocument())
   {
      final PDPage emptyPage = new PDPage();
      document.addPage(emptyPage);
      document.save("EmptyPage.pdf");
   }
   catch (IOException ioEx)
   {
      err.println(
         "Exception while trying to create blank document - " + ioEx);
   }
}

The next code listing is adapted from the Apache PDFBox 1.8 example "Hello World using a PDF base font" in the Document Creation "Cookbook" examples. The most significant change in this listing from that 1.8 Cookbook example is the replacement of deprecated methods PDPageContentStream.moveTextPositionByAmount(float, float) and PDPageContentStream.drawString(String) with PDPageContentStream.newLineAtOffset(float, float) and PDPageContentStream.showText(String) respectively.

Creating Simple PDF with Font
/**
 * Create simple, single-page PDF "Hello" document.
 */
private void createHelloDocument()
{
   final PDPage singlePage = new PDPage();
   final PDFont courierBoldFont = PDType1Font.COURIER_BOLD;
   final int fontSize = 12;
   try (final PDDocument document = new PDDocument())
   {
      document.addPage(singlePage);
      final PDPageContentStream contentStream = new PDPageContentStream(document, singlePage);
      contentStream.beginText();
      contentStream.setFont(courierBoldFont, fontSize);
      contentStream.newLineAtOffset(150, 750);
      contentStream.showText("Hello PDFBox");
      contentStream.endText();
      contentStream.close();  // Stream must be closed before saving document.

      document.save("HelloPDFBox.pdf");
   }
   catch (IOException ioEx)
   {
      err.println(
         "Exception while trying to create simple document - " + ioEx);
   }
}

The next code listing demonstrates parsing text from a PDF using Apache PDFBox. This extremely simple implementation parses all of the text into a single String using PDFTextStripper.getText(PDDocument). In most realistic situations, I'd not want all the text from the PDF in a single String and would likely use PDFTextStripper's ability to more narrowly specify which text to parse. It's also worth noting that while this code listing gets the PDF from online (Scala by Example PDF at http://www.scala-lang.org/docu/files/ScalaByExample.pdf), there are numerous constructors for PDDocument that allow one to access PDFs on file systems and via other types of streams.

Parsing Text from Online PDF

/**
 * Parse text from an online PDF.
 */
private void parseOnlinePdfText()
{
   final String address = "http://www.scala-lang.org/docu/files/ScalaByExample.pdf";
   try
   {
      final URL scalaByExampleUrl = new URL(address);
      final PDDocument documentToBeParsed = PDDocument.load(scalaByExampleUrl.openStream());
      final PDFTextStripper stripper = new PDFTextStripper();
      final String pdfText = stripper.getText(documentToBeParsed);
      out.println("Parsed text size is " + pdfText.length() + " characters:");
      out.println(pdfText);
   }
   catch (IOException ioEx)
   {
      err.println("Exception while trying to parse text from PDF at " + address);
   }
}

The JDK 8 Issue

PDFBox 2 exposes an issue in JDK 8 that is filed under Bug JDK-8041125 ("ColorConvertOp filter much slower in JDK 8 compared to JDK7"). The Apache PDFBox "Getting Started" documentation describes the issue, "Due to the change of the java color management module towards 'LittleCMS', users can experience slow performance in color operations." This same "Getting Started" section provides the work-around: "disable LittleCMS in favour of the old KCMS (Kodak Color Management System)."

The bug appears to have been identified and filed by IDR Solutions in conjunction with their commercial Java PDF library JPedal. Their blog post Major change to Color performance in newer Java releases provides more details related to this issue.

The just-mentioned posts and documentation, including Apache PDFBox 2's "Getting Started" section, explicitly demonstrate use of Java system properties to work-around the issue by explicitly specifying using of KCMS (which could be removed at any time) instead of the default LittleCMS. As these sources state, one can either provide the system property to the Java launcher [java] with the -D option [-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider] or specify the property within the executable code itself [System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");].

It sounds like this issue is not exclusive to version 2 of Apache PDFBox, but is more commonly seen with Apache PDFBox 2 because version 2 uses dependent constructs more frequently and because it's more likely that someone using Java 8 is also using the newer PDFBox.

The change in JDK 8 of the default implementation associated with property sun.java2d.cmm demonstrates a point I tried to make in my recent blog post Observations From A History of Java Backwards Incompatibility. In that post, I concluded, "Beware of and use only with caution any APIs, classes, and tools advertised as experimental or subject to removal in future releases of Java." It turns out that the Java 2D system properties are in this class. The System Properties for Java 2D Technology page provides this background and warning information regarding use of these properties:

This document describes several unsupported properties that you can use to customize how the 2D painting system operates. You might use these properties to improve performance, fix incorrect rendering, or avoid system crashes under certain configurations. ... Warning: Take care when using these properties. Some of them are unsupported for very practical reasons. ... Since these properties have the sole purpose of enabling or disabling implementation-specific behaviors, they are subject to change or removal without notification. Some properties might work only on the exact product releases for which they are documented.

Conclusion

Apache PDFBox 2 is a relatively easy way to manipulate PDF documents in Java. Its liberal Apache 2 license makes it amenable to a very large audience and its open source nature allows developers to see how to use the libraries it uses underneath the covers and adapt it as needed.

Additional Resources