I briefly mentioned before that there were a number of factors that influenced whether parallel streams were faster or slower than sequential streams; let’s take a look at those factors now. Understanding what works well and what doesn’t will help you to make an informed decision about how and when to use parallel streams. There are five important factors that influence parallel streams performance that we’ll be looking at:
Data size
There is a difference in the efficiency of the parallel speedup due to the size of the input data. There’s an overhead to decomposing the problem to be executed in parallel and merging the results. This makes it worth doing only when there’s enough data that execution of a streams pipeline takes a while. We explored this back in “Parallel Stream Operations” on page 83.
Source data structure
Each pipeline of operations operates on some initial data source; this is usually a collection. It’s easier to split out subsections of different data sources, and this cost affects how much parallel speedup you can get when executing your pipeline.
Packing
Primitives are faster to operate on than boxed values.
Number of cores
The extreme case here is that you have only a single core available to operate upon, so it’s not worth going parallel. Obviously, the more cores you have access to, the greater your potential speedup is. In practice, what counts isn’t just the number of cores allocated to your machine; it’s the number of cores that are available for your machine to use at runtime. This means factors such as other processes executing simultaneously or thread affinity (forcing threads to execute on certain cores or CPUs) will affect performance.
Cost per element
Like data size, this is part of the battle between time spent executing in parallel and overhead of decomposition and merging. The more time spent operating on each element in the stream, the better performance you’ll get from going parallel.
When using the parallel streams framework, it can be helpful to understand how prob‐
lems are decomposed and merged. This gives us a good insight into what is going on under the hood without having to understand all the details of the framework.
Let’s take a look at how a concrete example is decomposed and merged. Example 6-6 shows some code that performs parallel integer addition.
Performance | 89
Example 6-6. Parallel integer addition
private int addIntegers(List<Integer> values) { return values.parallelStream()
.mapToInt(i -> i) .sum();
}
Under the hood, parallel streams back onto the fork/join framework. The fork stage recursively splits up a problem. Then each chunk is operated upon in parallel. Finally, the join stage merges the results back together.
Figure 6-2 shows how this might apply to Example 6-6.
Figure 6-2. Decomposing and merging using fork/join
Let’s assume that the streams framework is splitting up our work to operate in parallel on a four-core machine:
1. Our data source is decomposed into four chunks of elements.
2. We perform leaf computation work in parallel on each thread in Example 6-6. This involves mapping each Integer to an int and also summing a quarter of the values in each thread. Ideally, we want to spend as much of our time as possible in leaf computation work because it’s the perfect case for parallelism.
3. We merge the results. In Example 6-6 this is just a sum operation, but it might involve any kind of reduce, collect, or terminal operation.
Given the way problems are decomposed, the nature of the initial source is extremely important in influencing the performance of this decomposition. Intuitively, the ease
with which we can repeatedly split a data structure in half corresponds to how fast it can be operated upon. Splitting in half also means that the values to be operated upon need to split equally.
We can split up common data sources from the core library into three main groups by performance characteristics:
The good
An ArrayList, an array, or the IntStream.range constructor. These data sources all support random access, which means they can be split up arbitrarily with ease.
The okay
The HashSet and TreeSet. You can’t easily decompose these with perfect amounts of balance, but most of the time it’s possible to do so.
The bad
Some data structures just don’t split well; for example, they may take O(N) time to decompose. Examples here include a LinkedList, which is computationally hard to split in half. Also, Streams.iterate and BufferedReader.lines have unknown length at the beginning, so it’s pretty hard to estimate when to split these sources.
The influence of the initial data structure can be huge. To take an extreme example, benchmarking a parallel sum over 10,000 integers revealed an ArrayList to be 10 times faster than a LinkedList. This isn’t to say that your business logic will exhibit the same performance characteristics, but it does demonstrate how influential these things can be. It’s also far more likely that data structures such as a LinkedList that have poor decompositions will also be slower when run in parallel.
Ideally, once the streams framework has decomposed the problem into smaller chunks, we’ll be able to operate on each chunk in its own thread, with no further communication or contention between threads. Unfortunately, reality can get the way of the ideal at times!
When we’re talking about the kinds of operations in our stream pipeline that let us operate on chunks individually, we can differentiate between two types of stream op‐
erations: stateless and stateful. Stateless operations need to maintain no concept of state over the whole operation; stateful operations have the overhead and constraint of main‐
taining state.
If you can get away with using stateless operations, then you will get better parallel performance. Examples of stateless operations include map, filter, and flatMap; sor ted, distinct, and limit are stateful.
Performance | 91
Performance-test your own code. The advice in this section offers rules of thumb about what performance characteristics should be investigated, but nothing beats measuring and profiling.