Saturday, April 18, 2015

The JDK 8 SummaryStatistics Classes

Three of the new classes introduced in JDK 8 are DoubleSummaryStatistics, IntSummaryStatistics, and LongSummaryStatistics of the java.util package. These classes make quick and easy work of calculating total number of elements, minimum value of elements, maximum value of elements, average value of elements, and the sum of elements in a collection of doubles, integers, or longs. Each class's class-level Javadoc documentation begins with the same single sentence that succinctly articulates this, describing each as "A state object for collecting statistics such as count, min, max, sum, and average."

The class-level Javadoc for each of these three classes also states of each class, "This class is designed to work with (though does not require) streams." The most obvious reason for the inclusion of these three types of SummaryStatistics classes is to be used with streams that were also introduced with JDK 8.

Indeed, each of the three class's class-level Javadoc comments also provide an example of using each class in conjunction with streams of the corresponding data type. These examples demonstrate invoking the respective Streams' collect(Supplier, BiConsumer, BiConsumer) method (a mutable reduction terminal stream operation) and passing each SummaryStatistics class's new instance (constructor), accept, and combine methods (as method references) to this collect method as its "supplier", "accumulator", and "combiner" arguments respectively.

The rest of this post demonstrates use of IntSummaryStatistics, LongSummaryStatistics, and DoubleSummaryStatistics. Several of these examples will reference a map of The X-Files television series's seasons to the Nielsen rating for that season's premiere. This is shown in the next code listing.

Declaring and Initializing xFilesSeasonPremierRatings
/**
 * Maps the number of each X-Files season to the Nielsen rating
 * (millions of viewers) for the premiere episode of that season.
 */
private final static Map<Integer, Double> xFilesSeasonPremierRatings;

static
{
   final Map<Integer, Double> temporary = new HashMap<>();
   temporary.put(1, 12.0);
   temporary.put(2, 16.1);
   temporary.put(3, 19.94);
   temporary.put(4, 21.11);
   temporary.put(5, 27.34);
   temporary.put(6, 20.24);
   temporary.put(7, 17.82);
   temporary.put(8, 15.87);
   temporary.put(9, 10.6);
   xFilesSeasonPremierRatings = Collections.unmodifiableMap(temporary);
}

The next code listing uses the map created in the previous code listing, demonstrates applying DoubleSummaryStatistics to stream of the "values" portion of the map, and is very similar to the examples provided in the Javadoc for the three SummaryStatistics classes. The DoubleSummaryStatistics class, the IntSummaryStatistics class, and the LongSummaryStatistics class have essentially the same fields, methods, and APIs (only differences being the supported datatypes). Therefore, even though this and many of this post's examples specifically use DoubleSummaryStatistics (because the X-Files's Nielsen ratings are doubles), the principles apply to the other two integral types of SummaryStatistics classes.

Using DoubleSummaryStatistics with a Collection-based Stream
/**
 * Demonstrate use of DoubleSummaryStatistics collected from a
 * Collection Stream via use of DoubleSummaryStatistics method
 * references "new", "accept", and "combine".
 */
private static void demonstrateDoubleSummaryStatisticsOnCollectionStream()
{
   final DoubleSummaryStatistics doubleSummaryStatistics =
      xFilesSeasonPremierRatings.values().stream().collect(
         DoubleSummaryStatistics::new,
         DoubleSummaryStatistics::accept,
         DoubleSummaryStatistics::combine);
   out.println("X-Files Season Premieres: " + doubleSummaryStatistics);
}

The output from running the above demonstration is shown next:

X-Files Season Premieres: DoubleSummaryStatistics{count=9, sum=161.020000, min=10.600000, average=17.891111, max=27.340000}

The previous example applied the SummaryStatistics class to a stream based directly on a collection (the "values" portion of a Map). The next code listing demonstrates a similar example, but uses an IntSummaryStatistics and uses a stream's intermediate map operation to specify which Function to invoke on the collection's objects for populating the SummaryStatistics object. In this case, the collection being acted upon in a Set<Movie> as returned by the Java8StreamsMoviesDemo.getMoviesSample() method and spelled out in my blog post Stream-Powered Collections Functionality in JDK 8.

Using IntSummaryStatistics with Stream's map(Function)
/**
 * Demonstrate collecting IntSummaryStatistics via mapping of
 * certain method calls on objects within a collection and using
 * lambda expressions (method references in particular).
 */
private static void demonstrateIntSummaryStatisticsWithMethodReference()
{
   final Set<Movie> movies = Java8StreamsMoviesDemo.getMoviesSample();
   IntSummaryStatistics intSummaryStatistics =
      movies.stream().map(Movie::getImdbTopRating).collect(
         IntSummaryStatistics::new, IntSummaryStatistics::accept, IntSummaryStatistics::combine);
   out.println("IntSummaryStatistics on IMDB Top Rated Movies: " + intSummaryStatistics);
}

When the demonstration above is executed, its output looks like this:

IntSummaryStatistics on IMDB Top Rated Movies: IntSummaryStatistics{count=5, sum=106, min=1, average=21.200000, max=49}

The examples so far have demonstrated using the SummaryStatistics classes in their most common use case (in conjunction with data from streams based on existing collections). The next example demonstrates how a DoubleStream can be instantiated from scratch via use of DoubleStream.Builder and then the DoubleStream's summaryStatistics() method can be called to get an instance of DoubleSummaryStatistics.

Obtaining Instance of DoubleSummaryStatistics from DoubleStream
/**
 * Uses DoubleStream.builder to build an arbitrary DoubleStream.
 *
 * @return DoubleStream constructed with hard-coded doubles using
 *    a DoubleStream.builder.
 */
private static DoubleStream createSampleOfArbitraryDoubles()
{
   return DoubleStream.builder().add(12.4).add(13.6).add(9.7).add(24.5).add(10.2).add(3.0).build();
}

/**
 * Demonstrate use of an instance of DoubleSummaryStatistics
 * provided by DoubleStream.summaryStatistics().
 */
private static void demonstrateDoubleSummaryStatisticsOnDoubleStream()
{
   final DoubleSummaryStatistics doubleSummaryStatistics =
      createSampleOfArbitraryDoubles().summaryStatistics();
   out.println("'Arbitrary' Double Statistics: " + doubleSummaryStatistics);
}

The just-listed code produces this output:

'Arbitrary' Double Statistics: DoubleSummaryStatistics{count=6, sum=73.400000, min=3.000000, average=12.233333, max=24.500000}

Of course, similarly to the example just shown, IntStream and IntStream.Builder can provide an instance of IntSummaryStatistics and LongStream and LongStream.Builder can provide an instance of LongSummaryStatistics.

One doesn't need to have a collection stream or other instance of BaseStream to use the SummaryStatistics classes because they can be instantiated directly and used directly for the predefined numeric statistical operations. The next code listing demonstrates this by directly instantiating and then populating an instance of DoubleSummaryStatistics.

Directly Instantiating DoubleSummaryStatistics
/**
 * Demonstrate direct instantiation of and population of instance
 * of DoubleSummaryStatistics instance.
 */
private static void demonstrateDirectAccessToDoubleSummaryStatistics()
{
   final DoubleSummaryStatistics doubleSummaryStatistics =
      new DoubleSummaryStatistics();
   doubleSummaryStatistics.accept(5.0);
   doubleSummaryStatistics.accept(10.0);
   doubleSummaryStatistics.accept(15.0);
   doubleSummaryStatistics.accept(20.0);
   out.println("Direct DoubleSummaryStatistics Usage: " + doubleSummaryStatistics);
}

The output from running the previous code listing is shown next:

Direct DoubleSummaryStatistics Usage: DoubleSummaryStatistics{count=4, sum=50.000000, min=5.000000, average=12.500000, max=20.000000}

As done in the previous code listing for a DoubleSummaryStatistics, the next code listing instantiates a LongSummaryStatistics directly and populates it). This example also demonstrates how the SummaryStatistics classes provide individual methods for requesting individual statistics.

Directly Instantiating LongSummaryStatistics / Requesting Individual Statistics
/**
 * Demonstrate use of LongSummaryStatistics with this particular
 * example directly instantiating and populating an instance of
 * LongSummaryStatistics that represents hypothetical time
 * durations measured in milliseconds.
 */
private static void demonstrateLongSummaryStatistics()
{
   // This is a series of longs that might represent durations
   // of times such as might be calculated by subtracting the
   // value returned by System.currentTimeMillis() earlier in
   // code from the value returned by System.currentTimeMillis()
   // called later in the code.
   LongSummaryStatistics timeDurations = new LongSummaryStatistics();
   timeDurations.accept(5067054);
   timeDurations.accept(7064544);
   timeDurations.accept(5454544);
   timeDurations.accept(4455667);
   timeDurations.accept(9894450);
   timeDurations.accept(5555654);
   out.println("Test Results Analysis:");
   out.println("\tTotal Number of Tests: " + timeDurations.getCount());
   out.println("\tAverage Time Duration: " + timeDurations.getAverage());
   out.println("\tTotal Test Time: " + timeDurations.getSum());
   out.println("\tShortest Test Time: " + timeDurations.getMin());
   out.println("\tLongest Test Time: " + timeDurations.getMax());
}

The output from this example is now shown:

Test Results Analysis:
 Total Number of Tests: 6
 Average Time Duration: 6248652.166666667
 Total Test Time: 37491913
 Shortest Test Time: 4455667
 Longest Test Time: 9894450

In most examples in this post, I relied on the SummaryStatistics classes' readable toString() implementations to demonstrate the statistics available in each class. This last example, however, demonstrated that each individual type of statistic (number of values, maximum value, minimum value, sum of values, and average value) can be retrieved individually in numeric form.

Conclusion

Whether the data being analyzed is directly provided as a numeric Stream, is provided indirectly via a collection's stream, or is manually placed in the appropriate SummaryStatistics class instance, the three SummaryStatistics classes can provide useful common statistical calculations on integers, longs, and doubles.

No comments: