IntelĀ® MPI Benchmarks 4.0
To reduce measurement errors caused by insufficient clock resolution, every benchmark is run repeatedly. The repetition count is as follows:
For IMB-MPI1, IMB-NBC, and aggregate flavors of IMB-EXT, IMB-IO, and IMB-RMA benchmarks, the repetition count is MSGSPERSAMPLE. This constant is defined in IMB_settings.h/IMB_settings_io.h, with 1000 and 50 values, respectively.
To avoid excessive run times for large transfer sizes X, an upper bound is set to OVERALL_VOL/X. The OVERALL_VOL value is defined in IMB_settings.h/IMB_settings_io.h, with 4MB and 16MB values, respectively.
Given transfer size X, the repetition count for all aggregate benchmarks is defined as follows:
n_sample = MSGSPERSAMPLE (X=0)
n_sample = max(1,min(MSGSPERSAMPLE,OVERALL_VOL/X)) (X>0)
The repetition count for non-aggregate benchmarks is defined completely analogously, with MSGSPERSAMPLE replaced by MSGS_NONAGGR. A reduced count is recommended as non-aggregate run times are usually much longer.
In the following examples, elementary transfer means a pure function (MPI_[Send, ...], MPI_Put, MPI_Get, MPI_Accumulate, MPI_File_write_XX, MPI_File_read_XX), without any further function call. Assured completion transfer completion is:
MPI_Win_fence for IMB-EXT benchmarks
a triplet MPI_File_sync/MPI_Barrier(file_communicator)/MPI_File_sync for IMB-IO Write benchmarks
MPI_Win_flush, MPI_Win_flush_all, MPI_Win_flush_local, or MPI_Win_flush_local_all for IMB-RMA benchmarks
empty for all other benchmarks
for ( i=0; i<N_BARR; i++ ) MPI_Barrier(MY_COMM) time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) execute MPI pattern time = (MPI_Wtime()-time)/n_sample
For aggregate benchmarks, the kernel loop looks as follows:
for ( i=0; i<N_BARR; i++ )MPI_Barrier(MY_COMM) /* Negligible integer (offset) calculations ... */ time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) execute elementary transfer assure completion of all transfers time = (MPI_Wtime()-time)/n_sample
For non-aggregate benchmarks, every single transfer is safely completed:
for ( i=0; i<N_BARR; i++ )MPI_Barrier(MY_COMM) /* Negligible integer (offset) calculations ... */ time = MPI_Wtime() for ( i=0; i<n_sample; i++ ) { execute elementary transfer assure completion of transfer } time = (MPI_Wtime()-time)/n_sample
A nonblocking benchmark has to provide three timings:
t_pure - blocking pure I/O time
t_ovrl- nonblocking I/O time concurrent with CPU activity
t_CPU - pure CPU activity time
The actual benchmark consists of the following stages:
Calling the equivalent blocking benchmark as defined in Actual Benchmarking and taking benchmark time as t_pure.
Closing and re-opening the particular file(s).
Re-synchronizing the processes.
Running the nonblocking case, concurrent with CPU activity (exploiting t_CPU when running undisturbed), taking the effective time as t_ovrl.
The desired CPU time to be matched approximately by t_CPU is set in IMB_settings_io.h:
#define TARGET_CPU_SECS 0.1 /* unit seconds */