Calculate Distribution From Collection In Java

Republished By Plato

Followers: 0

Turning a collection of numbers (or objects who’se fields you’d like to inspect) into a distribution of those numbers is a common statistical technique, and is employed in various contexts in reporting and data-driven applications.

Given a collection:

1, 1, 2, 1, 2, 3, 1, 4, 5, 1, 3

You can inspect their distribution as a count (frequency of each element), and store the results in a map:

{
"1": 5,
"2": 2,
"3": 2,
"4": 1,
"5": 1
}

Or, you can normalize the values based on the total number of values – thus expressing them in percentages:

{
"1": 0.45,
"2": 0.18,
"3": 0.18,
"4": 0.09,
"5": 0.09
}

Or even express these percentages in a 0..100 format instead of a 0..1 format.

In this guide, we’ll take a look at how you can calculate a distribution from a collection – both using primitive types and objects who’se fields you might want to report in your application.

With the addition of functional programming support in Java – calculating distributions is easier than ever. We’ll be working with a collection of numbers and a collection of Books:

public class Book {

    private String id;
    private String name;
    private String author;
    private long pageNumber;
    private long publishedYear;

   
}

Calculate Distribution of Collection in Java

Let’s first take a look at how you can calculate a distribution for primitive types. Working with objects simply allows you to call custom methods from your domain classes to provide more flexibility in the calculations.

By default, we’ll represent the percentages as a double from 0.00 to 100.00.

Primitive Types

Let’s create a list of integers and print their distribution:

List integerList = List.of(1, 1, 2, 1, 2, 3, 1, 4, 5, 1, 3);
System.out.println(calculateIntegerDistribution(integerList));

The distribution is calculated with:

public static Map calculateIntegerDistribution(List list) {
    return list.stream()
            .collect(Collectors.groupingBy(Integer::intValue,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.2f", count * 100.00 / list.size()))))));
}

This method accepts a list and streams it. While streamed, the values are grouped by their integer value – and their values are counted using Collectors.counting(), before being collected into a Map where the keys represent the input values and the doubles represent their percentages in the distribution.

The key methods here is collect() which accepts two collectors. The key-collector collects by simply grouping by the key values (input elements). The value-collector collects via the collectingAndThen() method, which allows us to count the values and then format them in another format, such as count * 100.00 / list.size() which lets us express the counted elements in percentages:

{1=45.45, 2=18.18, 3=18.18, 4=9.09, 5=9.09}

Sort Distribution by Value or Key

When creating distributions – you’ll typically want to sort the values. More often than not, this’ll be by key. Java HashMaps don’t guarantee to preserve order of insertion, so we’ll have to use a LinkedHashMap which does. Additionally, it’s easiest to re-stream the map and re-collect it now that it’s a much smaller size and much more manageable.

The previous operation can quickly collapse multiple thousand records into small maps, depending on the number of keys you’re dealing with, so re-streaming isn’t expensive:

public static Map calculateIntegerDistribution(List list) {
    return list.stream()
            .collect(Collectors.groupingBy(Integer::intValue,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.2f", count.doubleValue() / list.size()))))))
            
            
            .entrySet()
            .stream()
            .sorted(Map.Entry.comparingByKey())
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

Objects

How can this be done for objects? The same logic applies! Instead of an identify function (Integer::intValue), we’ll be using the desired field instead – such as the published year for our books. Let’s create a few books, store them in a list and then calculate the distributions of the publishing years:

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Book book1 = new Book("001", "Our Mathematical Universe", "Max Tegmark", 432, 2014);
Book book2 = new Book("002", "Life 3.0", "Max Tegmark", 280, 2017);
Book book3 = new Book("003", "Sapiens", "Yuval Noah Harari", 443, 2011);
Book book4 = new Book("004", "Steve Jobs", "Water Isaacson", 656, 2011);

List books = Arrays.asList(book1, book2, book3, book4);

Let’s calculate the distribution of the publishedYear field:

public static Map calculateDistribution(List books) {
    return books.stream()
            .collect(Collectors.groupingBy(Book::getPublishedYear,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.2f", count * 100.00 / books.size()))))))
            
            .entrySet()
            .stream()
            .sorted(Map.Entry.comparingByKey())
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

Adjust the "%.2f" to set the floating point precision. This results in:

{2011=50.0, 2014=25.0, 2017=25.0}

50% of the given books (2/4) were published in 2011, 25% (1/4) were published in 2014 and 25% (1/4) in 2017. What if you want to format this result differently, and normalize the range in 0..1?

Calculate Normalized (Percentage) Distribution of Collection in Java

To normalize the percentages from a 0.0...100.0 range to a 0..1 range – we’ll simply adapt the collectingAndThen() call to not multiply the count by 100.0 before dividing by the size of the collection.

Previously, the Long count returned by Collectors.counting() was implicitly converted into a double (multiplication with a double value) – so this time, we’ll want to explicitly get the doubleValue() of the count:

    public static Map calculateDistributionNormalized(List books) {
        return books.stream()
            .collect(Collectors.groupingBy(Book::getPublishedYear,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.4f", count.doubleValue() / books.size()))))))
            
            .entrySet()
            .stream()
            .sorted(comparing(e -> e.getKey()))
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

Adjust the "%.4f" to set the floating point precision. This results in:

{2011=0.5, 2014=0.25, 2017=0.25}

Calculate Element Count (Frequency) of Collection

Finally – we can get the element count (frequency of all elements) in the collection by simply not dividing the count by the size of the collection! This is a fully non-normalized count:

   public static Map calculateDistributionCount(List books) {
        return books
            .stream()
            .collect(Collectors.groupingBy(Book::getPublishedYear,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Integer.parseInt(String.format("%s", count.intValue()))))))
            
            .entrySet()
            .stream()
            .sorted(Map.Entry.comparingByKey())
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

This results in:

{2011=2, 2014=1, 2017=1}

Indeed, there are two books from 2011, and one from 2014 and 2017 each.

Conclusion

Calculating distributions of data is a common task in data-rich applications, and doesn’t require the use of external libraries or complex code. With functional programming support, Java made working with collections a breeze!

In this short draft, we’ve taken a look at how you can calculate frequency counts of all elements in a collection, as well as how to calculate distribution maps normalized to percentages between 0 and 1 as well as 0 and 100 in Java.

Time Stamp: October 11, 2022November 3, 2022