# An Overview of Intel Xeon Phi – Tuning & Perf.

#### Lecture Outline

Following topics will be discussed

Understanding of Intel Multi-Core Systems with Intel Xeon Phi Programming from Performance Point of View

# Intel Xeon Phi Coprocessor : Prog. Env & Tips for obtaining Performance (Part-I)

# Xeon Phi : Programming Environment

 Shared Address Space Programming (Offload, Native, Host)

OpenMP, Intel TBB, Cilk Plus, Pthreads

Message Passing Programming
 (Offload – MIC Offload /Host Offload)

(Symmetric & Coprocessor /Host)

# Hybrid Programming

(MPI – OpenMP, MPI Cilk Plus MPI-Intel TBB)



## Intel Xeon-Phi : Shared Address Space Prog.

- Rule of thumb : An application must scale well past one hundred threads on Intel Xeon processors to profit from the possible higher parallel performance offered with e.g. the Intel Xeon Phi coprocessor.
- The scaling would profit from utilizing the highly parallel capabilities of the MIC architecture, you should start to create a simple performance graph with a varying number of threads (from one up to the number of cores)

# Intel Xeon-Phi : Shared Address Prog.

- What we should know from programming point of view : We treat the coprocessor as a 64-bit x86 SMPon-a-chip with an high-speed bi-directional ring interconnect, (up to) four hardware threads per core and 512-bit SIMD instructions.
- With the available number of cores, we have easily 200 hardware threads at hand on a single Intel Xeon coprocessor.
- Resource availability and Memory access is an important for threading on all 60 Cores.

# Intel Xeon-Phi : Programming Env.

# **Keys to Productive Performance**

- Choose the right Multi-core centric or Manycore centric model for your application
- Vectorize your application (today)

➤Use the Intel vectorizing compiler

- Parallelize your application (today)
  - ➤with MPI (or other multi-process model)

Go asynchronous to overlap computation and communication Source : References & Intel Xeon-Phi; http://www.intel.com/

# Intel Xeon-Phi : Performance-Tips

## Performance on Xeon Phi using different prog.

- What we should know from programming point of view : We treat the coprocessor as a 64-bit x86 SMP-on-a-chip with an high-speed bi-directional ring interconnect, (up to) four hardware threads per core and 512-bit SIMD instructions.
- With the available number of cores, we have easily 200 hardware threads at hand on a single coprocessor.

# Intel Xeon System & Xeon-Phi

# Performance on Xeon Phi using different prog.

# **About Hyper-Threading**

hyper-threading hardware threads can be switched off and can be ignored.

# About Threading on Xeon-Phi Coprocessor

- The multi-threading on each core is primarily used to hide latencies that come implicitly with an in-order microarchitecture. Unlike hyperthreading these hardware threads cannot be switched off and should never be ignored.
- In general a minimum of three or four active threads per cores will be needed.

### Performance on Xeon Phi using different prog.

- Use asynchronous data transfer and double buffering offloads to overlap the communication with the computation
- Optimizing memory use on Intel MIC architecture target relies on understanding access patterns
- Loop Optimizations may benefit performance

# Intel Xeon Phi Coprocessor :Native Compilation

# To achieve good Performance - Following information should be kept in mind.

- Data should be aligned to 64 Bytes (512 Bits) for the MIC architecture, in contrast to 32 Bytes (256 Bits) for AVX and 16 Bytes (128 Bits) for SSE.
- Due to the large SIMD width of 64 Bytes vectorization is even more important for the MIC architecture than for Intel Xeon!
- The MIC architecture offers new instructions like
  - > gather/scatter,
  - > fused multiply-add,
  - masked vector instructions etc.

which allow more loops to be parallelized on the coprocessor than on an **Intel Xeon based host**.

# Intel Xeon Phi Coprocessor : Native Compilation

To achieve good Performance - Following information should be kept in mind.

Use pragmas like

- > #pragma ivdep,
- > #pragma vector always,
- > #pragma vector aligned,
- > #pragma simd

etc. to achieve autovectorization.

Autovectorization is enabled at default optimization level -02. Requirements for vectorizable loops can be found references.

# Intel Xeon Phi Coprocessor : Native Compilation

# To achieve good Performance - Following information should be kept in mind.

- Let the compiler generate vectorization reports using the compiler option -vecreport2 to see if loops were vectorized for MIC (Message "\*MIC\* Loop was vectorized" etc).
- The options -opt-report-phase hlo (High Level Optimizer Report) or

-opt-report-phase ipo\_inl (Inlining report) may also be useful.

# Intel Xeon Phi Coprocessor :Native Compilation

# To achieve good Performance - Following information should be kept in mind.

- Explicit vector programming is also possible via Intel Cilk Plus language extensions (C/C++ array notation, vector elemental functions, ...) or the new SIMD constructs from OpenMP 4.0 RC1.
- Vector elemental functions can be declared by using <u>attributes</u> ((vector)). The compiler then generates a vectorized version of a scalar function which can be called from a vectorized loop.

# Intel Xeon Phi Coprocessor : Native Compilation

# To achieve good Performance - Following information should be kept in mind.

- One can use intrinsics to have full control over the vector registers and the instruction set.
- Include <immintrin.h> for using intrinsics.
- ✤ Hardware prefetching from the L2 cache is enabled per default.
- In addition, software prefetching is on by default at compiler optimization level -O2 and above. Since Intel Xeon Phi is an inorder architecture, care about prefetching is more important than on out-of-order architectures.

# Intel Xeon Phi Coprocessor : Native Compilation

To achieve good Performance - Following information should be kept in mind.

The compiler prefetching can be influenced by setting the compiler switch -opt-prefetch = n.

Manual prefetching can be done by using intrinsics (\_mm\_prefetch()) or pragmas (#pragma prefetch var).

# Intel Xeon Phi Coprocessor : Prog. Env & Tips for obtaining Performance (Part-II)

#### **Optimization Framework**

A collection of methodology and tools that enable the developers to express parallelism for Multicore and Manycore Computing

**Objective:** Turning unoptimized program into a scalable, highly parallel application on multicore and manycore architecture

Step 1: Leverage Optimized Tools, Library

Step 2: Scalar, Serial Optimization / Memory Access

**Step 3: Vectorization & Compiler** 

Step 4: Parallelization

**Step 5: Scale from Multicore to Manycore** 

Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

#### A Family of Parallel Programming Models Developer Choice



**Applicable to Multicore and Manycore Programming** 

Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

#### **Objective of Scalar and Serial Optimization**



- Obtain the most efficient implementation for the problem at hand
- Identify the opportunity for vectorization and parallelization
- Create Base to account for vectorization and parallelization gains
  - Avoid situation when vectorized, slower code was parallelized and create a false impression of performance gain

#### **Algorithmic Optimizations**

- Elevate constants out of the core loops
  - > Compiler can do it, but it need your cooperation
  - Group constants together
- ✤ Avoid and replace expensive operations
  - > divide a constant can usually be replace by multiplying its reciprocal
- ✤ Strength reduction in hot loop
  - > People like inductive method, because it's clean
  - > Iterative can strength reduce the operation involved
  - > In this example, exp() is replace by a simple multiplication

```
const double dt = T / (double)TIMESTEPS;
const double vDt = V * sqrt(dt);
for(int i = 0; i <= TIMESTEPS; i++){
  double price = S * exp(vDt * (2.0 * i -
        TIMESTEPS));
  cell[i] = max(price - X, 0);
  }
```

```
const double factor = exp(vDt * 2);
double price = S * exp(-
vDt(2+TIMESTEPS));
for(int i = 0; i <= TIMESTEPS; i++){
    price = factor * price;
    cell[i] = max(price - X, 0);
    }
```

Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

### **Use Compiler Optimization Switches**

| Optimization Done                              | Linux*                                                          |
|------------------------------------------------|-----------------------------------------------------------------|
| Disable optimization                           | -00                                                             |
| Optimize for speed (no code size increase)     | -01                                                             |
| Optimize for speed (default)                   | -02                                                             |
| High-level loop optimization                   | -03                                                             |
| Create symbols for debugging                   | -g                                                              |
| Multi-file inter-procedural optimization       | -ipo                                                            |
| Profile guided optimization (multi-step build) | -prof-gen                                                       |
|                                                | -prof-use                                                       |
| Optimize for speed across the entire program   | <b>-fast</b><br>(same as: -ipo –O3 -no-prec-div -static -xHost) |
| OpenMP 3.0 support                             | -openmp                                                         |
| Automatic parallelization                      | -parallel                                                       |

Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

### **Vectorization and SIMD Execution**

# **Step 3 :**

#### \* SIMD

- > Flynn's Taxonomy: Single Instruction, Multiple Data
- > CPU perform the same operation on multiple data elements

#### \* SISD

Single Instruction, Single Data

#### **\*** Vectorization

- In the context of Intel<sup>®</sup> Architecture Processors, the process of transforming a scalar operation (SISD), that acts on a single data element to the vector operation that that act on multiple data elements at once(SIMD).
- Assuming that setup code does not tip the balance, this can result in more compact and efficient generated code
- For loops in "normal" or "unvectorized" code, each assembly instruction deals with the data from only a single loop iteration

#### **SIMD** Abstraction – Options Compared



Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

#### **Get Your Code Vectorized by Intel Compiler**

- Data Layout, AOS -> SOA
- Data Alignment (next slide)
- ✤ Make the loop innermost
- ✤ Function call in treatment
  - Inline yourself
  - > inline! Use \_\_\_\_forceinline
  - Define your own vector version
  - Call vector math library SVML
- ✤ Adopt jumpless algorithm
- Read/Write is OK if it's continuous
- ✤ Loop carried dependency

#### Not a true dependency

```
for(int i = TIMESTEPS; i > 0; i--)
    #pragma simd
    #pragma unroll(4)
    for(int j = 0; j <= i - 1; j++)
cell[j]=puXDf*cell[j+1]+pdXDf*cell[j];
CallResult[opt] = (Basetype)cell[0];</pre>
```

| Array of Structures |    |    |
|---------------------|----|----|
| <b>S</b> 0          | X0 | Т0 |
| S1                  | X1 | T1 |
|                     |    |    |

| Structure of Arrays |    |  |
|---------------------|----|--|
| <b>S</b> 0          | S1 |  |
| X0                  | X1 |  |
| <b>S</b> 0          | S1 |  |

#### A true dependency

| for  | (j=1;  | j <max;< th=""><th>j++)</th></max;<> | j++)    |
|------|--------|--------------------------------------|---------|
| a[j] | = a[j] | + c *                                | a[j-n]; |

Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

#### Options for Parallelism on Intel® Architecture

# Step 4 :

- C++ template Library of parallel algorithms, containers
- Load balancing via work stealing
- Keyword extension of C/C++, Serial equivalence via compiler
- Load balancing via work stealing
- Well known industry standard
- Best suited when resource utilization is known at design time
- Time-tested industry standard for Unix-like
- Common denominator or other high level threading libraries



- What's available on Intel<sup>®</sup> host processor are also available on Intel<sup>®</sup> target coprocessor
- Many others (boost) are ported to the coprocessor
- Choose the best threading model your problem dictates

#### **Options for Parallelism – pthreads\***

- POSIX\* Standard for thread API with 20 years history
- Foundation for other high level threading libraries
- Independently exist on the host and Intel<sup>®</sup> MIC
- No extension to go from the host to Intel<sup>®</sup> MIC
- Advantage: Programmer has explicit control
  - From workload partition to thread creation, synchronization, load balance, affinity settings, etc.
- Disadvantage: Programmer has too much control
  - Code longevity
  - Maintainability
  - Scalability

#### **Thread Affinity using pthreads\***

- Partition the workload to avoid load imbalance
  - Understand static vs. dynamic workload partition
- Use pthread API, define, initialize, set, destroy
  - > Set CPU affinity with pthead\_setaffinity\_np()
  - Know the thread enumeration and avoid core 0
  - > Core 0 boots the coprocessor, job scheduler, service interrupts



Source : References & Intel Xeon-Phi; http://www.intel.com/

Prog. on Intel Xeon-Phi: Tuning & Perf.

### **Options for Parallelism – OpenMP\***

- Compiler directives/pragmas based threading constructs
  - Utility library functions and Environment variables
- Specify blocks of code executing in parallel



✤ Fork-Join Parallelism:

C-DAC hyPACK-2013

- > Master thread spawns a team of worker threads as needed
- Parallelism grow incrementally



29 29

#### **OpenMP\* Performance, Scalability Issues**

}

- Manage Thread Creation Cost
  - Create threads as early as possible, Maximize the work for worker threads
  - IA threads take some time to create, But once they're up, they last till the end
- Take advantage of memory locality, use
   NUMA memory manager
  - Allocate the memory on the thread that will access them later on.
  - Try not to allocate the memory the worker threads use in the main thread
- Ensure your OpenMP\* program works serially, compiles without openmp\*
  - > Protect OpenMP\* API calls with \_OPENMP,
  - Make sure serial works before enable
     OpenMP\* (e.g. compile with –openmp)
- Minimize the thread synchronization
  - use local variable to reduce the need to access global variable

```
Source : References & Intel Xeon-Phi;
http://www.intel.com/
```



}

```
#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
{
    CallResultList[opt] = 0;
    CallConfidenceList[opt] = 0;
```

```
#ifdef _OPENMP
int ThreadNum = omp_get_max_threads();
omp_set_num_threads(ThreadNum);
#else
int ThreadNum = 1;
#endif
```

#### Scale from Multicore to Manycore

# **Step 5 :**

#### **A Tale of Two Architectures**

| Intel <sup>®</sup> Xeon <sup>®</sup> processor | Intel <sup>®</sup> Xeon Phi™ Coprocessor                                                                         |
|------------------------------------------------|------------------------------------------------------------------------------------------------------------------|
| 2                                              | 1                                                                                                                |
| 2.6 GHz                                        | 1.1 GHz                                                                                                          |
| Out-of-order                                   | In-order                                                                                                         |
| 8                                              | Up to 61                                                                                                         |
| 2                                              | 4                                                                                                                |
| HyperThreading                                 | Round Robin                                                                                                      |
| 8SP, 4DP                                       | 16SP, 8DP                                                                                                        |
| 692SP, 346DP                                   | 2020SP, 1010DP                                                                                                   |
| 102GB/s                                        | 320GB/s                                                                                                          |
| 32kB                                           | 32kB                                                                                                             |
| 256kB                                          | 512kB                                                                                                            |
| 30MB                                           | none                                                                                                             |
|                                                | 2<br>2.6 GHz<br>Out-of-order<br>8<br>2<br>HyperThreading<br>8SP, 4DP<br>692SP, 346DP<br>102GB/s<br>32kB<br>256kB |

Source : References & Intel Xeon-Phi; <u>http://www.intel.com/</u>

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

### Assessing potential

#### \* Threads

- > Code analysis loop nesting, iteration counts, determinism
- > Intel Vtune<sup>™</sup> Amplifier timeline analysis existence of applciation serialization
- Performance vs. threads knee of the curve

#### Vectorization

- > VTune Amplifier hot spots and compiler VEC reports
- > HW PerfMon-based evaluation
- Performance vs. vectorization on/off

#### \* Bandwidth

> HW PerfMon-based evaluation

### More on Thread Affinity

- Bind the worker threads to certain processor core/threads
- Minimizes the thread migration and context switch
- Improves data locality; reduce coherency traffic
- Two components to the problem:
  - > How many worker threads to create?
  - How to bind worker threads to core/threads?
- Two ways to specify thread affinity
  - Environment variables OMP\_NUM\_THREADS, KMP\_AFFINITY
  - > C/C++ API: kmp\_set\_defaults("KMP\_AFFINITY=compact")
    omp\_set\_num\_threads(244);
    - Add to your source file#include <omp.h>
    - Compiler with –openmp
    - Use libiomp5.so

### The Optimal Thread Number

- Intel MIC maintains 4 hardware contexts per core
  - Round-robin execution policy,
  - Require 2 threads for decent performance
  - > Financial algorithms takes all 4 threads to peak
- Intel Xeon processor optionally use HyperThreading
  - Execute-until-stall execution policy
  - Fruly compute intensive ones peak with 1 thread per core
  - Finance algorithms likes HyperThreading, 2 threads per core
- Use OpenMP application with NCORE number of cores
  - Host only: 2 x ncore (or 1x if HyperThreading disabled)
  - > MIC Native: 4 x ncore
  - > **Offload:** 4 x (ncore-1) OpenMP runtime avoids the core OS runs

# Intel Xeon Phi Coprocessor : Prog. Env & Tips for obtaining Performance (Part-III)

# Intel Xeon Phi Coprocessor : Prog. Env &

#### **Use Compiler Optimization Switches**

| Optimization Done                              | Linux*                                                          |
|------------------------------------------------|-----------------------------------------------------------------|
| Disable optimization                           | -00                                                             |
| Optimize for speed (no code size increase)     | -01                                                             |
| Optimize for speed (default)                   | -02                                                             |
| High-level loop optimization                   | -03                                                             |
| Create symbols for debugging                   | -g                                                              |
| Multi-file inter-procedural optimization       | -ipo                                                            |
| Profile guided optimization (multi-step build) | -prof-gen; -prof-use                                            |
| Optimize for speed across the entire program   | <b>-fast</b><br>(same as: -ipo –O3 -no-prec-div -static -xHost) |
| OpenMP 3.0 support                             | -openmp                                                         |
| Automatic parallelization                      | -parallel                                                       |

Source : References & Intel Xeon-Phi; http://www.intel.com/

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

# **Prog.API - Multi-Core Systems with Devices**

Performance: Intel Xeon-Phi Coprocessor

- Vectorization is key for performance
  - Sandybridge, MIC, etc.
  - ➤Compiler hints
  - Code restructuring
- Many-core nodes present scalability challenges
  - Memory contention
  - Memory size limitations

# Intel Xeon-Phi : Prog. Env. Perf Issuses

# **Options for Vectorization : Use Tools**



# Intel Xeon Phi : Coprocessors – Intel Compiler's Offload Programs

# **Optimised Offloaded Code**

Tuning & Performance :

- Using intrinsics with manual data prefetching and register blocking can still considerably increase the performance.
- Try to get a suitable vectorization and write cache and register efficient code, i.e. values stored in registers should be reused as often as possible in order to avoid cache and memory access.

## Intel Xeon Phi Prog. : Tools to Measure Overheads

- Quantification of Overheads : Use
   Tools on Intel Xeon Phi
- Prog.on Shared Address Space
   Platforms (UMA/NUMA)
  - Data Parallel Fortran 2008, Pthreads, OpenMP, Intel TBB Cilk Plus
  - Explicit Message Passing MPI Cluster of Message Passing Multi-Core systems





Intel Xeon & Xeon Phi : Execution Modes

#### Quantification of Overheads – Explicit / Implicit Data Transfer – Using Offload





- Separate executables run on both MIC and Xeon
  - e.g. Standalone MPI applications
- No source code modifications most of the time
  - ➢ Recompile code for Xeon Phi™ Coprocessor
- Autonomous Compute Node (ACN)



- "main" runs on Xeon
- ✤ Parts of code are offloaded to MIC
- Code that can be
  - Multi-threaded, highly parallel
  - Vectorizable
  - > Benefit from large memory BW
- Compiler Assisted vs. Automatic
  - #pragma offload (...)

# Intel Xeon-Phi : Programming Env.

# **Pros:**

- Compilation with an additional Intel compiler flag (-mmic);
- Scalability tests: fast and smooth;
- Quick analysis with Intel tools (VtuneT, Itac Intel Trace Analyzer and Collector;
- Porting time: one day with validation of the numerical result;
- expert developer of FARM, with good knowledge of the Intel Compiler, But with only a basic knowledge of MIC.
- Best scalability with OpenMP and Hybrid.

## Xeon Phi : Programming Environment

#### Porting on MIC : Issues to be addressed

- MPI Init routine problem: increasing CPU time for increasing number of processes; Same problem when using two MICs together;
- Detailed analysis of OpenMP threads & thread affinity and Memory available per thread
- Execution time depends strongly from code vectorization, so compiler vectorization for data parallel and task parallel constructs
- code re-structure and memory access pattern are a key point to have a vectorizable satisfactory overall Performances.

#### Factors to work around

- Limited problem size or limited exposure
  - > Inherent lack of available parallelism
  - Parallelism not adequately exposed by programmer
- Excessive synchronization
  - Inhibits harvesting thread parallelism
- ✤ ISA-specific issues
  - > Data structures excessively rely on scatter/gather
  - > Use of 64b integer indices and 64 INT  $\leftarrow$   $\rightarrow$  FP conversion
- Offload overhead
  - Excessive communication/computation ratio, unhidden communication
- Memory footprint and working set size
  - Limited to 8GB, unless you "overlay," e.g. with offload

#### Prefetch on Intel Multicore and Many-core

- Objective: Move data from memory to L1 or L2 Cache in anticipation of CPU Load/Store
- More import on in-order Intel Xeon Phi Coprocessor
- Less important on out of order Intel Xeon Processor
- Compiler prefetching is on by default for Intel<sup>®</sup> Xeon Phi<sup>™</sup> coprocessors at −O2 and above
- Compiler prefetch is not enabled by default on Intel<sup>®</sup> Xeon<sup>®</sup>
   Processors
  - > Use external options -opt-prefetch[=n] n = 1.. 4
- Use the compiler reporting options to see detailed diagnostics of prefetching per loop
  - > Use -opt-report-phase hlo -opt-report 3

#### **Automatic Prefetches**

#### Loop Prefetch

- Compiler generated prefetches target memory access in a future iteration of the loop
- Target regular, predictable array and pointer access

#### Interactions with Hardware prefetcher

- ☆ Intel<sup>®</sup> Xeon Phi<sup>™</sup> Comprocessor has a hardware L2 prefetcher
- If Software prefetches are doing a good job, Hardware prefetching does not kick in
- References not prefetched by compiler may get prefetched by hardware prefetcher

# **Explicit Prefetch**

#### Use Intrinsics

> \_mm\_prefetch((char \*) &a[i], hint);

See xmmintrin.h for possible hints (for L1, L2, non-temporal, ...)

- > But you have to specify the prefetch distance
- Also gather/scatter prefetch intrinsics, see zmmintrin.h and compiler user guide, e.g. \_mm512\_prefetch\_i32gather\_ps

#### Use a pragma / directive (easier):

- > #pragma prefetch a [:hint[:distance]]
- You specify what to prefetch, but can choose to let compiler figure out how far ahead to do it.

#### \* Use Compiler switches:

- > -opt-prefetch-distance=n1[,n2]
- > specify the prefetch distance (how many iterations ahead, use n1 and prefetches inside loops. n1 indicates distance from memory to L2.

## **Streaming Store**

- Avoid read for ownership for certain memory write operation
- Bypass prefetch related to the memory read
- Use #pragma vector nontemporal (v1,...) to drop a hint to compiler
- Without Streaming Stores 448 B
  - With Streaming Stores, 320
     Bytes read/write per iteration
  - Relief Bandwidth pressure; improve cache utilization
  - –vec-report6 displays the compiler action

bs\_test\_sp.c(215): (col. 4) remark: vectorization support: streaming store was generated for CallResult. bs\_test\_sp.c(216): (col. 4) remark: vectorization support: streaming store was generated for PutResult.

```
_____ I /. . ...! L _ . . . . . !L _ ... L ! _ ...
for (int chunkBase = 0; chunkBase < OptPerThread; chunkBase +=</pre>
CHUNKSIZE)
{
#pragma simd vectorlength(CHUNKSIZE)
#pragma simd
#pragma vector aligned
#pragma vector nontemporal (CallResult, PutResult)
      for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++)</pre>
      {
         float CNDD1;
         float CNDD2;
         float CallVal =0.0f, PutVal = 0.0f;
         float T = OptionYears[opt];
         float X = OptionStrike[opt];
         float S = StockPrice[opt];
         CallVal = S * CNDD1 - XexpRT * CNDD2;
         PutVal = CallVal + XexpRT - S;
         CallResult[opt] = CallVal ;
         PutResult[opt] = PutVal ;
      }
}
```

# Data Blocking

- Partition data to small blocks that fits in L2 Cache
  - > Exploit data reuse in the application.
  - > Ensure the data remains in the cache across multiple uses
  - > Using the data in cache remove the need to go to memory
  - > Bandwidth limited program may execute at FLOPS limit
- Simple case of 1D
  - > Data size DATA\_N is used WORK\_N times from 100s of threads
  - Each handles a piece of work and have to traverse all data

#### Without Blocking

- > 100s of thread pound on different area of DATA\_N
- Memory interconnet limit the performance

```
#pragma omp parallel for
for(int wrk = 0; wrk < WORK_N; wrk++)
{
    initialize_the_work(wrk);
    for(int ind = 0; ind < DATA_N; ind++)
    {
        dataptr datavalue = read_data(dataind);
        result = compute(datavalue);
        aggregate = combine(aggregate, result);
    }
    postprocess_work(aggregate);
}
```

#### With Blocking

- Cacheable BSIZE of data is processed by all 100s threads a time
- Each data is read once kept reusing until all threads are done with it

```
for(int BBase = 0; BBase < DATA_N; BBase += BSIZE)
{
#pragma omp parallel for
   for(int wrk = 0; wrk < WORK_N; wrk++)
   {
        initialize_the_work(wrk);
        for(int ind = BBase; ind < BBase+BSIZE; ind++)
        {
            dataptr datavalue = read_data(ind);
            result = compute(datavalue);
            aggregate[wrk] = combine(aggregate[wrk], result);
        }
        postprocess_work(aggregate[wrk]);
    }
</pre>
```

Source : References & Intel Xeon-Phi; <u>http://www.intel.com/</u>

C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

## **Memory Alignment**

- Allocated memory on heap
  - > \_mm\_malloc(int size, int aligned)
  - > scalable\_aligned\_malloc(int size, int aligned)

#### Declarations memory:

- > \_\_attribute\_\_((aligned(n))) float v1[];
- declspec(align(n)) float v2[];
- Use this to notify compiler
  - assume\_aligned(array, n);
- Natural boundary
  - Unaligned access can fault the processor
- Cacheline Boundary
  - Frequently accessed data should be in 64
- ✤ 4K boundary
  - Sequentially accessed large data should be in 4K boundary

| Instruction | Length   | Alignment |
|-------------|----------|-----------|
| SSE         | 128 Bits | 16 Bytes  |
| AVX         | 256 Bits | 32 Bytes  |
| IMCI        | 512 Bits | 64 Bytes  |

## **Double Buffering Example**

- Transfer and work on a dataset in small pieces
- While part is being transferred, work on another part!



C-DAC hyPACK-2013

Prog. on Intel Xeon-Phi: Tuning & Perf.

# Computing – Enabling Huge Memory – Implementation using Memory Mapping (mmap)

# **Memory Mapping**

**Implementation:** Matrix into Matrix Multiplication using mmap (Assume that Matrix Size A = 1,00,000 Real float and Matrix Size B = 1,00,000 Real float)

- Translation of address issued by some device (e.g., CPU or I/O device) to address sent out on memory bus (physical address)
- Mapping is performed by memory management units



#### Computing – Enabling Huge Memory – Implementation using Memory Mapping (mmap)

## Address Mapping Function (Review)



C-DAC hyPACK-2013

## Intel Xeon Phi :Coprocessor Offload Prog.

## Memory – Huge Pages and Pre-faulting

- ✤ IA processors support multiple page sizes; commonly 4K and 2MB
- Some applications will benefit from using huge pages
  - Applications with sequential access patterns will improve due to larger TLB "reach"
- TLB miss vs. Cache miss
  - > TLB miss means walking the 4 level page table hierarchy
    - Each page walk could result in additional cache misses
  - TLB is a scarce resource and you need to "manage" them well
- ♦ On Intel® Xeon Phi<sup>™</sup> Coprocessor
  - ➢ 64 entries for 4K, 8 entries for 2MB
  - > Additionally, 64 entries for second level DTLB.
    - Page cache for 4K, L2 TLB for 2MB pages
- Linux supports huge pages CONFIG\_HUGETLBFS
  - 2.6.38 also has support for Transparent Huge Pages (THP)
- Pre-faulting via MAP\_POPULATE flag to mmap()

## Intel Xeon Phi : The Intel Composer XE 2013

- The Intel Composer XE Development tool and SDK suite available for developing Intel Xeon Phi
  - It includes C/C++ Fortran Complier
  - It includes runtime libraries like OpenMP, thread etc. Debuging tool and math kernel library (MKL)
  - Supports various parallel programming models fro Intel Xeon Phi such as Intel Cilk Plus, Intel Threading Building blocks (TBB), OpenMP and Pthread
  - It includes Intel MKL

#### Intel Trace Analyzer and Collector (ITAC)

#### Intel MPI, Intel Trace Analyzer and Collector(ITAC) on MIC

- Intel Trace Collector gathers information from running programs into a trace file, and the Intel Trace Analyzer allows the collected data to be viewed and analyzed after a run.
- The Intel Trace Analyzer and Collector support processors and coprocessors.
- The Trace Collector can integrate information from multiple sources including an instrumented Intel MPI Library and PAPI.
- Trace file from an application running on the host system and coprocessor simultaneously can be generated
- Generate trace file only **on Coprocessor** system