Over the past two weeks we have experienced Oakely login node crashes potentially caused by a Lustre bug.

Phi Compiling Guide

This document was created to guide users through the compiling and execution of programs for Ruby's Phi coprocessors.  It is not intended to help determine which of the Phi usage models to use.  No special actions are needed for programs running exclusively on the host  For more general information on Ruby and its Phi coprocessors see our Ruby FAQ page.  Only Fortran, C, and C++ code can be compiled to run on the Phi coprocessors.  Code to be run on Ruby or the Xeon Phi coprocessors should be compiled on Ruby.

The Intel Xeon Phi accelerators are referred to as "Phis", and the Intel Xeon CPU as "Host" for this guide

All Usage Models

Users compiling for Ruby should ensure they have the newest version of the Intel Compilers loaded.

Intel compiler suite version 15.0.0 can be loaded with the command:

module load intel/15.0.0

A list of the Intel compiler suite versions available can be seen with:

module spider intel

General Performance Considerations

  • Code should be parallelized.  Due to the simplified architecture of the Phi, serial code run on the Phi will usually be slower than the same serial code run on the host Xeon CPU.  Only through parallel computation can the Phi's power be fully utilized.  
  • Code should be vectorized.  Vectorization is the unrolling of a loop so that one operation can be performed on multiple pairs of operands at once.  The Phi has extra-wide vector units compared to a CPU, increasing the importance of vectorization for performance.
Performance increases due to threading and vectorization
Chart showing the importance of vectorization and multi-threading (Image courtesy Intel)

 

Native Mode

This is the simplest usage model for running code on the Xeon Phi coprocessors.  Code is compiled on the host to be run exclusively on the Phi coprocessor.

To compile an application for the native usage model, use the -mmic compiler flag:

icc -O3 -mmic helloWorld.c -o helloWorld.out

Home directories (rooted at /nfs) are mounted to the Phis, so as long as your application resides within there you do not need to copy it over to the Phi.

You can start your application on the Phi remotely from the host using the following syntax:

ssh mic0-r0007 ~/helloWorld.out
Hello World

Make sure to replace the Phi hostname and application path and name with your own.

If your application requires any shared libraries, make sure they are both in a location accessible from the Phi and specified on the Phi.  Shared locations include all home directories located on /nfs.  You can also copy any necessary libraries over to the Phi's /tmp folder.  

The Phi's have a minimal environment to start with.  If you require a LD_LIBRARY_PATH (or any other environment variables)  for your application you will need to manually set it on the Phis.  If you copied your necessary library files to /tmp, you could do the following from the Phi:

export LD_LIBRARY_PATH=/tmp

To check what environment variables the Phi comes with, run the following from the host:

ssh mic0-r0007 env

MPI Usage

MVAPICH2 can be used within natively compiled code to spawn MPI tasks exclusively on the Phi.  The only additional steps required are the setting of the environment variable I_MPI_MIC to 1 at runtime and making sure your processes are launched on the Phi.

Setting I_MPI_MIC to 1 at runtime enables the MPI library on the host to recognize and work with the Phi:

export I_MPI_MIC=1 

Making sure your processes are executed on the Phi is as simple as specifying to mpiexec to launch on the Phi.  Note the use of mpiexec.hydra, not mpiexec:

mpiexec.hydra -host mic0 -n 16 /tmp/MPI_prog.out

An alternative is to ssh to the Phi and launch mpiexec from there:

mpiexec.hydra -n 16 /tmp/MPI_prog.out

Important performance considerations:

  • Data should be aligned to 64 Bytes (512 bits)
  • Due to the large SIMD width of 64 Bytes, vectorization is crucial
  • Use the -vec-report2 compiler flag to generate vectorization reports to see whether loops have been vectorizied for the Phi architecture
    • If vectorized, messages will read "*MIC* Loop was vectorized" or similar

 

Intel MKL Automatic Offload (AO)

Some Intel MKL functions are Automatic Offload capable; if the library call is made after automatic offloading has been enabled, MKL will automatically decide at runtime whether or not to offload some or all of the calls to the Phi.  This decision is based upon the problem size, load on the processors, and other metrics.  This offloading is completely transparent to the user, and no special compiler options are needed.  If the Phi is not available for any reason, MKL functions will fall back to executing on the host.

Automatic Offload enabled functions

The following Level-3 BLAS functions and LAPACK functions are AO-enabled as of the latest MKL version 11.1, available on Ruby:

  • *GEMM, *SYMM, *TRMM, and *TRSM
  • LU, QR, Cholesky factorizations

* (asterisk) is a wildcard specifying all data types (S, D, C, and Z).

Enabling and Disabling Automatic Offload

Automatic Offload can be both enabled and disabled through the setting of an environment variable or the call of a support function.  Compiler pragmas are not needed -- users can compile and link code the usual way.

To enable AO in FORTRAN or C code:

rc = mkl_mic_enable()

Alternatively, to enable AO through an environment variable:

export MKL_MIC_ENABLE=1

 

To disable AO in FORTRAN or C code:

rc = mkl_mic_disable()

Alternatively, to disable AO through an environment variable:

export MKL_MIC_ENABLE=0

Using Automatic Offload and Compiler Assisted Offload in the same program

The Intel MKL library supports the use of both Automatic Offload and Compiler Assisted Offload in the same program.  When doing so, users need to explicitly specify work division for AO aware functions using support functions or environment variables.  By default, if the work division is not specified, all work will be done on the host.  

Force execution failure if offload not available

Intel MKL will default to running computations on the host if the Phi is not available for any reason.  Whether or not computations were offloaded to the Phi will not be apparent to the user.  To force execution to fail if the offload fails, use the following command to set the proper environment variable:

export MKL_MIC_DISABLE_HOST_FALLBACK=1

Setting this will cause programs to exit with the error message "Could not enable Automatic Offload" if an offload attempt fails.

Generate offload report

By default, automatic offload operations are transparent to the user; whether or not work was offloaded and how much of that work was offloaded will not be apparent to the user.  To allow users to examine these details, MKL can generate an offload report at runtime.  The environment variable OFFLOAD_REPORT needs to be set to 1 or 2 before runtime to do this.

export OFFLOAD_REPORT=1

Setting OFFLOAD_REPORT to 0 (or not setting it) results in no offload report.

Setting OFFLOAD_REPORT to 1 results in a report including:

  • Name of function called
  • Effective Work Division
  • Time spent on Host during call
  • Time spent on each available Phi coprocessor during call

Setting OFFLOAD_REPORT to 2 results in a report including everything from 1, and in addition:

  • Amount of data transferred to and from each Phi during call

Important performance considerations:

  • Automatic offload performs best on large, square matrices

For more information on using the Intel MKL automatic offload feature, refer to Intel's guide on the subject.

 

Compiler Assisted Offload (CAO)

In Compiler Assisted Offload, pragmas, also known as directives, are added to the code specifying sections of that code to offload their execution to the Phis.  These offload regions do not require any special coding considerations, and can utilize OpenMP and Intel Clik programming models.  When the compiler reaches an offload pragma, it generates code for both the host and the Phi.  The resulting executable consists of code for both the host and the Phi.

Currently, the Intel compiler supports Intel's Language Extensions for Offload (LEO) for markup.  It is expected version 4.0 of the OpenMP standard will include offload directives for the Phi coprocessors as well.

Adding offload directives

The primary step in preparing code for CAO is to add directives specifying when and how to offload code to the Phi.  Here is a basic example of what these offload directives look like in C:

int main(){
...
    //offload code
    #pragma offload target(mic)
    {
        //parallelisms via OpenMP on the MIC
        #pragma omp parallel for
        for( i = 0; i < k; i++ ){
            for( j = 0; i < k; j++ ){
                a[i] = tan(b[j]) + cos(c[j]);
            }
        } //end OpenMP section
    } //end offload section
...
}

..and the same example in Fortran:

program main
...
!dir$ offload begin target(mic)
!$omp parallel do
do i = 1,K
    do j = 1,K
        a(i) = tan(b(j)) + cos(c(j))
    end do
end do
!dir$ end offload
...
end program
   

Specifiers can be added to specify the target Phi (useful for when multiple Phis are available) and to control the flow of data to and from the Phi.  An example of these specifiers in C:

#pragma offload target(mic:0) inout(a) in(b,c)

This directive is specifying:

  • This section of code be offloaded to a specific Phi coprocessor, in this case 0.
  • The inout specifier defines a variable be both copied to the Phi and back to the host.
  • The in specifier defines a variable as strictly input to the coprocessor.  The value is not coped back to the host

For more information on directives and additional specifiers refer to Intel's Effective Use of the Intel Compiler's Offload Features.

Compilation

No additional steps are required at the compile or link stage.  

Execution

No special steps are required at runtime; offload of specified sections of code and data transfers are automatically handled.

Controlling Offload with Environment Variables

Environment variables can be used to affect the way the offload runtime library operates.  These environment variables are prefixed with either "MIC_" or "OFFLOAD_".  Listed below are some commonly used environment variables:

MIC_LD_LIBRARY_PATH

Sets the path where shared libraries needed by the MIC offloaded code reside.

OFFLOAD_REPORT

When set to 1 or 2, offload details are printed to standard out, with 2 including details of data transfers.

OFFLOAD_DEVICES

Restricts the process to only use the specified Phis.  Multiple Phis can be specified using commas.

MIC_ENV_PREFIX

By default, all environment variables defined on the host are replicated to the coprocessors execution environment when an offload occurs.  This behavior can be modified by defining this environment variable.  When defined, only environment variables on the host prefixed with MIC_ENV_PREFIX's value are passed on to the Phi.  The passed environment variables are set on the Phi with the prefix stripped.  This is particularly valuable for controlling OpenMP, MPI, and Intel Clik environment variables. 

Setting MIC_ENV_PREFIX has no effect on the fixed MIC_* environment variables such as MIC_LD_LIBARAY_PATH.

MPI

While calling MPI functions within offload regions is not supported, offloading within a MPI program is supported by the Intel MPI library.  When offloading, however, no attempt is made to coordinate the Phi's resource usage amongst the MPI ranks.  If 12 MPI ranks running on the host all offload 8 threads to the Phi, all of these threads will be spawned on the first 8 cores of the Phi.  As can be seen, this can quickly lead to resource conflicts.  A performance penalty is also incurred when multiple ranks offload simultaneously to a single Phi.  

Mitigating these issues is beyond the scope of this guide; please refer to Using MPI and Xeon Phi Offload Together for more information.

For more detailed information on programming for the CAO model please refer to Intel's Effective Use of the Intel Compiler's Offload Features.

 

Symmetric/Heterogeneous Offload

Called both Symmetric and Heterogeneous offloading, this programming model treats the Phi as simply another node in a heterogeneous cluster.  MPI ranks are spawned on both the host and Phi.  Because the Phi cannot run a executable compiled for the host, two separate executables need to be prepared.  Getting these separate executables to run from the same mpiexec.hydra call requires adding a prefix or postfix to the Phi executables name and setting the respective environment variable.

Setup

Remember to source both the compilervars and mpivars files before starting as outlined in the all usage models section above.

Make sure to have your desired implementation loaded before compilation.  We recommend using the MVAPICH2 MPI implementation that is loaded by default.

Compilation

Executables must be compiled for both the host and Phi separately. You must use the Intel compiler and the Intel MPI implementation.  To compile the phi executable, include the -mmic flag at compilation.  No special considerations are required for the host executable.

# Create host executable
mpicc helloworld.c -o helloworld.out

# Create Phi executable
mpicc -mmic helloworld.c -o helloworld.out.mic

Execution

Make sure to have your chosen MPI implementation module loaded at runtime.

Once in a job, first, create a MPI hosts file containing the hosts to run on on separate lines:

-Bash-4.1$ cat mpi_hosts
r0001
mic0-r0001

Notice that mic# goes before the Xeon hostname, separated by a hyphen.  In this case we will target both one Xeon CPU and one Phi coprocessor.

Then set I_MPI_MIC to 1 so the MPI library on the host recognizes and works with the Phi:

export I_MPI_MIC=1 

Next, let MPI know how you identify your Phi executable in comparison to your host executable.  In our case we used the postfix .micto identify our Phi executable and thus we will need to set I_MPI_MIC_PREFIX.

export I_MPI_MIC_POSTFIX=.mic

Alternatively a prefix can be used.  The prefix option enables Phi specific executables to be stored in a separate directory.

Finally, from the host start the program up. Note the use of mpiexec.hydra, not mpiexec.

mpiexec.hydra -f mpi_hosts -pernost 1 -n 2 helloworld.out

 

MPI

Coming soon.

 

 

Supercomputer: