Slow MATLAB parfor performance and possible Parallel Toolbox bug

After playing with the MATLAB Parallel Computing Toolbox (r2009a x64), I noticed what appears to be significant overhead when writing to an array in parallel, increasing the speed of a parallel for loop by over an order of magnitude. This effect isn’t observed in other programming languages like C, demonstrated below. While testing this, I also came across repeatable behavior indicative of a bug in MATLAB (see last code block). These tests were performed using a 4 core Intel i5 with 8GB of memory.

First is a loop the sums the numbers from 1 to n, applying a reduction to the variable sum. This case is embarrassingly parallel and shows a fairly linear increase in speed with respect to number of cores.

<code>n = 10000000;
 
sum=0;
fprintf('serial: ')
tic
for i=1:n
    sum = sum + i;
end
toc
 
sum=0;
fprintf('parallel: ')
tic
parfor i=1:n
    sum = sum + i;
end
toc</code>


serial: Elapsed time is 8.241937 seconds.
parallel: Elapsed time is 2.531207 seconds.

Now, instead of using the loop counter to increment the variable sum, the elements of an array [1,2,…,n] are summed. This requires MATLAB to read from an array using 4 threads, one for each core. A linear speed increase is demonstrated (as expected) here as well.

<code>n = 10000000;
data = 1:n; % array with elements [1,2,...,n-1,n]
 
sum=0;
fprintf('serial: ')
tic
for i=1:n
    sum = sum + data(i);
end
toc
 
sum=0;
fprintf('parallel: ')
tic
parfor i=1:n
    sum = sum + data(i);
end
toc</code>


serial: Elapsed time is 9.642881 seconds.
parallel: Elapsed time is 2.705653 seconds.

Instead of reading from an array in parallel, an array is written to in parallel. Here there is a 7x slowdown when parallelizing the loop.

<code>n = 10000000; % 10 million
data = zeros(1,n);
 
fprintf('serial: ')
tic
for i=1:n
    data(i) = i;
end
toc
 
data=zeros(1,n);
fprintf('parallel: ')
tic
parfor i=1:n
    data(i) = i;
end
toc</code>


serial: Elapsed time is 0.618991 seconds.
parallel: Elapsed time is 4.359711 seconds.

Interestingly, this slowdown doesn’t occur when using OpenMP in C; using multiple cores allows for a 26% speedup during the equivalent array allocation. This is not 4x since the majority of CPU time is spent waiting for the next cache line to be retrieved from memory since no time is spent computing.

<code>#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>
 
int main(int argc, char **argv) {
 
	int n = 200000000; //200 million
	int i;
	double* data = malloc(n*sizeof(double));
	clock_t start = clock(), diff;
 
	for (i=0; i<n; i++)
		data[i] = i;
 
 
	diff = clock() - start;
	free(data);
	int msec = diff * 1000 / CLOCKS_PER_SEC;
	printf("Serial: Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000);
 
 
	data = malloc(n*sizeof(double));
	start = clock();
	#pragma omp parallel for
	for (i=0; i<n; i++)
		data[i] = i;
 
	diff = clock() - start;
	free(data);
	msec = diff * 1000 / CLOCKS_PER_SEC;
	printf("Parallel: Time taken %d seconds %d milliseconds", msec/1000, msec%1000);
 
	return 0;
}</code>


Serial: Time taken 1 seconds 583 milliseconds
Parallel: Time taken 1 seconds 167 milliseconds

Here is one additional observation which, by itself, leads me to believe there is some form of bug in the MATLAB Parallel Computing Toolbox. Introducing an additional variable before each loop slows the serial code by nearly 29x. This doesn’t make any sense as the variable isn’t used in the loop and should have no effect on computation time whatsoever. But let’s look at speed difference:

<code>clear
n = 10000000;
data = zeros(1,n);
 
sum=0; % new, unused variable
fprintf('serial: ')
tic
for i=1:n
    data(i) = i;
end
toc
 
sum=0; % new, unused variable
data=zeros(1,n);
fprintf('parallel: ')
tic
parfor i=1:n
    data(i) = i;
end
toc</code>


serial: Elapsed time is 17.926449 seconds.
parallel: Elapsed time is 4.477197 seconds.

This behavior is consistent and replicable. Simply commenting out the sum=0; radically changes the speed. There should be no difference, but try pasting the above code into a MATLAB editor and running the code, then the identical code without the sum=0;. Further, pasting the code directly into the MATLAB terminal does not produce a difference in speed for the serial code; only running from a MATLAB editor window produces this anomaly.

Leave a Reply