For early ML Engineers and Information Scientists, to grasp reminiscence fundamentals, parallel execution, and the way code is written for CPU and GPU.
This text goals to elucidate the basics of parallel computing. We begin with the fundamentals, together with understanding shared vs. distributed architectures and communication inside these methods. We are going to discover GPU structure and the way coding parts (utilizing C++ Kokkos) assist map architectural ideas to code implementation. Lastly, we are going to measure efficiency metrics (speedup) utilizing the runtime information obtained from working the Kokkos code on each CPU and GPU architectures for vector-matrix multiplication, one of the vital frequent operations within the machine studying area.
The central theme of the article is exploring and answering questions. It might appear to be a prolonged journey, however will probably be value it. Let’s get began!
I get that parallel computing saves time by working a number of operations without delay. However I’ve heard that system time is completely different from human time or wall-clock time. How is it completely different?
The smallest unit of time in computing is known as a clock tick. It represents the minimal time required to carry out an operation, reminiscent of fetching information, executing computations, or throughout communication. A clock tick technically refers back to the change of state vital for an instruction. The state will be processor state, information state, reminiscence state, or management alerts. In a single clock tick, an entire instruction, a part of an instruction, or a number of directions could also be executed.
CPU permits for a restricted variety of state modifications per second. For instance, a CPU with 3GHz clock velocity permits for 3 billion modifications of state per second. There’s a restrict to the allowable clock velocity as a result of every clock tick generates warmth, and extreme velocity can harm the CPU chip because of the warmth generated.
Due to this fact, we need to make the most of the obtainable capability by utilizing parallel computing methodologies. The aim is to cover reminiscence latency (the time it takes for the primary information to reach from reminiscence), improve reminiscence bandwidth (the quantity of knowledge transferred per unit of time), and improve compute throughput (the duties carried out in a clock tick).
To match efficiency, reminiscent of when calculating effectivity of a parallel program, we use wall-clock time as a substitute of clock ticks, because it contains all real-time overheads like reminiscence latency and communication delays, that can not be instantly translated to clock ticks.
What does the structure of a primary system appear like?
A system can include a single processor, a node, or perhaps a cluster. A number of the bodily constructing blocks of a system are —
- Node — A bodily laptop unit that has a number of processor chips. A number of nodes mix to type a cluster.
- Processor Chips — Chips comprise a number of processing parts known as cores.
- Core — Every core is able to working an unbiased thread of execution.
In set phrases, a node can have a one-to-many relationship with processor chips, and every processor chip can have a one-to-many relationship with cores. The picture beneath offers a visible description of a node with processors and cores.
The non-physical elements of a system embrace threads and processes —
- Thread — Thread is a sequence of CPU directions that working system treats as a single unit for scheduling and execution functions.
- Course of — In computing, a course of is a coherent unit of useful resource allocation, together with reminiscence, file handlers, ports and units. A single course of could handle assets for a number of threads. Threads can be modelled as components of a process.
So, do threads run on cores on the identical system, or can they be unfold throughout completely different methods for a single program? And in both case, how do they impart? How’s reminiscence dealt with for these threads ? Do they share it, or do they every get their very own separate reminiscence?
A single program can execute throughout a number of cores on the identical or completely different methods/ nodes. The design of the system and this system determines whether or not it aligns with the specified execution technique.
When designing a system, three key elements have to be thought of: execution (how threads run), reminiscence entry (how reminiscence is allotted to those threads), and communication (how threads talk, particularly when they should replace the identical information). It’s vital to notice that these elements are principally interdependent.
Execution
Serial execution — This makes use of a single thread of execution to work on a single information merchandise at any time.
Parallel execution — On this, multiple factor occurs concurrently. In computing, this may be —
- One employee — A single thread of execution working on a number of information gadgets concurrently (vector directions in a CPU). Think about a single individual sorting a deck of playing cards by swimsuit. With 4 fits to categorize, the person should undergo the complete deck to arrange the playing cards for every swimsuit.
- Working Collectively — A number of threads of execution in a single course of. It’s equal to a number of individuals working collectively to kind a single deck of playing cards by swimsuit.
- Working Independently — A number of processes can work on the identical drawback, using both the identical node or a number of nodes. On this state of affairs, every individual could be sorting their deck of playing cards.
- Any mixture of the above.
for the golf equipment swimsuit, so Employee B is quickly blocked. Ref: Cornell Virtual Workshop
Reminiscence Entry
- Shared Reminiscence — When a program runs on a number of cores (a single multithreaded course of) on the identical system, every thread inside the course of has entry to reminiscence in the identical digital tackle house.
- Distributed Reminiscence — A distributed reminiscence design is employed when a program makes use of a number of processes (whether or not on a single node or throughout completely different nodes). On this structure, every course of owns a portion of the info, and different processes should ship messages to the proprietor to replace their respective components. Even when a number of processes run on a single node, every has its personal digital reminiscence house. Due to this fact, such processes ought to use distributed reminiscence programming for communication.
- Hybrid Technique — Multithreaded processes that may run on the identical or completely different nodes, designed to make use of a number of cores on a single node via shared reminiscence programming. On the similar time, they’ll make use of distributed reminiscence methods to coordinate with processes on different nodes. Think about a number of individuals or threads working in a number of cubicles within the picture above. Staff in the identical cubicle talk utilizing shared reminiscence programming, whereas these in numerous cubicles work together via distributed reminiscence programming.
Communication
The communication mechanism will depend on the reminiscence structure. In shared reminiscence architectures, utility programming interfaces like OpenMP (Open Multi-Processing) allow communication between threads that share reminiscence and information. However, MPI (Message Passing Interface) can be utilized for communication between processes working on the identical or completely different nodes in distributed reminiscence architectures.
How can we inform if our parallelization technique is working successfully?
There are a number of strategies, however right here, we focus on effectivity and speedup. In parallel computing, effectivity refers back to the proportion of accessible assets which can be actively utilized throughout a computation. It’s decided by evaluating the precise useful resource utilization towards the height efficiency, i.e., optimum useful resource utilization.
Precise processor utilization refers back to the variety of floating level operations (FLOP) carried out over a particular interval.
Peak efficiency assumes that every processor core executes the utmost attainable FLOPs throughout each clock cycle.
Effectivity for parallel code is the ratio of precise floating-point operations per second (FLOPS) to the height attainable efficiency.
Speedup is used to evaluate effectivity and is measured as:
Speedup cannot be greater than the number of parallel resources when applications are restricted by computing velocity of the processors.
Utilizing speedup, parallel effectivity is measured as :
Suppose the serial execution of code took 300 seconds. After parallelizing the duties utilizing 50 cores, the general wall-clock time for parallel execution was 6 seconds. On this case, the speedup will be calculated because the wall-clock time for serial execution divided by the wall-clock time for parallel execution, leading to a speedup of 300s/6s = 50. We get parallel effectivity by dividing the speedup by the variety of cores, 50/50 = 1. That is an instance of the best-case state of affairs: the workload is completely parallelized, and all cores are utilized effectively.
Will including extra computing models continually enhance efficiency if the info measurement or variety of duties will increase?
Solely typically. In parallel computing, now we have two varieties of scaling primarily based on the issue measurement or the variety of parallel duties.
Sturdy Scaling — Growing the variety of parallel duties whereas protecting the issue measurement fixed. Nevertheless, whilst we improve the variety of computational models (cores, processors, or nodes) to course of extra duties in parallel, there may be an overhead related to communication between these models or the host program, such because the time spent sending and receiving information.
Ideally, the execution time decreases because the variety of parallel duties will increase. Nevertheless, if the code doesn’t get quicker with robust scaling, it may point out that we’re utilizing too many duties for the quantity of labor being carried out.
Weak Scaling — On this, drawback measurement will increase because the variety of duties improve, so computation per process stays fixed. In case your program has good weak scaling efficiency, you may run an issue twice as giant on twice as many nodes in the identical wall-clock time.
There are restrictions round what we are able to parallelize since some operations can’t be parallelized. Is that proper?
Sure, parallelizing sure sequential operations will be fairly difficult. Parallelizing will depend on a number of instruction streams and/or a number of information streams.
To know what will be parallelized, let’s take a look at SIMD in CPUs, which is achieved utilizing vectorization.
Vectorization is a programming method by which operations are utilized to complete arrays without delay relatively than processing particular person parts one after the other. It’s achieved utilizing the vector unit in processors, which incorporates vector registers and vector directions.
Contemplate a state of affairs the place we iterate over an array and carry out a number of operations on a single factor inside a for loop. When the info is unbiased, writing vectorizable code turns into easy; see the instance beneath:
do i, n
a(i) = b(i) + c(i)
d(i) = e(i) + f(i)
finish do
On this loop, every iteration is unbiased — which means a(i)
is processed independently of a(i+1)
and so forth. Due to this fact, this code is vectorizable, that may permit a number of parts of array a
to be computed in parallel utilizing parts from b
and c
, as demonstrated beneath:
b: | b(i) | b(i+1) | b(i+2) | b(i+3) | ... |
c: | c(i) | c(i+1) | c(i+2) | c(i+3) | ... |
------------------------------------------------
Vectorized Addition (SIMD)Vector Register 1 (loaded with b values):
| b(i) | b(i+1) | b(i+2) | b(i+3) | ... |
Vector Register 2 (loaded with c values):
| c(i) | c(i+1) | c(i+2) | c(i+3) | ... |
------------------------------------------------
Lead to Vector Register 3:
| a(i) | a(i+1) | a(i+2) | a(i+3) | ... |
Trendy compilers are typically able to analyzing such loops and reworking them into sequences of vector operations. Problem arises when an operation in one iteration depends upon the result of a previous iteration. In this case, automatic vectorization might lead to incorrect results. This situation is known as a data dependency.
Information dependencies generally encountered in scientific code are –
Learn After Write (RAW) — Not Vectorizable
do i, n
a(i) = a(i-1) +b(i)
Write After Learn (WAR) — Vectorizable
do i, n
a(i) = a(i+1) +b(i)
Write After Write (WAW) — Not Vectorizable
do i, n
a(ipercent2) = a(i+1) +b(i)
Learn After Learn (RAR) — Vectorizable
do i, n
a(i) = b(ipercent2) + c(i)
Adhering to sure customary guidelines for vectorization — reminiscent of guaranteeing unbiased assignments in loop iterations, avoiding random information entry, and stopping dependencies between iterations — may also help write vectorizable code.
When information will increase, it is sensible to parallelize as many parallelizable operations as attainable to create scalable options, however which means we’d like larger methods with plenty of cores. Is that why we use GPUs? How are they completely different from CPUs, and what results in their excessive throughput?
YES!
GPUs (Graphics processing models) have many extra processor models (inexperienced) and better mixture reminiscence bandwidth (the quantity of knowledge transferred per unit of time) as in comparison with CPUs, which, alternatively, have extra subtle instruction processing and quicker clock velocity. As seen above, CPUs have extra cache reminiscence than GPUs. Nevertheless, CPUs have fewer arithmetic logic models (ALUs) and floating level models (FPUs) than GPUs. Contemplating these factors, utilizing CPUs for advanced workflow and GPUs for computationally intensive duties is intuitive.
GPUs are designed to produce high computational throughput utilizing their massively parallel structure. Their computational potential will be measured in billions of floating level operations per second (GFLOPS). GPU {hardware} comes within the type of customary graphic playing cards (NVIDIA quad), Excessive-end accelerator playing cards (NVIDIA Tesla), and many others.
Two key properties of the graphics pipeline that allow parallelization and, thus, excessive throughput are —
- Independence of Objects — A typical graphics scene consists of many unbiased objects; every object will be processed individually with out dependencies on the others.
- Uniform Processing Steps — The sequence of processing steps is identical for all objects.
So, a number of cores of GPUs work on completely different information on the similar time, executing computations in parallel like a SIMD (Single Instruction A number of Information) structure. How are duties divided between cores? Does every core run a single thread like within the CPU?
In a GPU, Streaming Multiprocessors (SMs) are much like cores in a CPU. Cores in GPUs are much like vector lanes in CPUs. SMs are the {hardware} models that home cores.
When a operate or computation, referred as a kernel, is executed on the GPU, it’s usually damaged down into thread blocks. These thread blocks comprise a number of threads; every SM can handle many threads throughout its cores. If there are extra thread blocks than SMs, a number of thread blocks will be assigned to a single SM. Additionally, a number of threads can run on a single core.
Each SM further divides the thread blocks into groups called warps, with each warp consisting of 32 threads. These threads execute the identical stream of directions on completely different information parts, following a Single Instruction, A number of Information (SIMD) mannequin. The warp measurement is ready to 32 as a result of, in NVIDIA’s structure, CUDA cores are grouped into units of 32. This permits all threads in a warp to be processed collectively in parallel by the 32 CUDA cores, attaining excessive effectivity and optimized useful resource utilization.
In SIMD (Single Instruction, A number of Information), a single instruction acts uniformly on all information parts, with every information factor processed in precisely the identical manner. SIMT (Single Instruction, A number of Threads), which is often utilized in GPUs, relaxes this restriction. In SIMT, threads will be activated or deactivated in order that instruction and information are processed in energetic threads; nevertheless, the native information stays unchanged on inactive threads.
I need to perceive how we are able to code to make use of completely different architectures. Can comparable code work for each CPU and GPU architectures? What parameters and strategies can we use to make sure that the code effectively makes use of the underlying {hardware} structure, whether or not it’s CPUs or GPUs?
Code is mostly written in high-level languages like C or C++ and have to be transformed into binary code by a compiler since machines can’t instantly course of high-level directions. Whereas each GPUs and CPUs can execute the identical kernel, as we are going to see within the instance code, we have to use directives or parameters to run the code on a particular structure to compile and generate an instruction set for that structure. This method permits us to make use of architecture-specific capabilities. To make sure compatibility, we are able to specify the suitable flags for the compiler to provide binary code optimized for the specified structure, whether or not it’s a CPU or a GPU.
Varied coding frameworks, reminiscent of SYCL, CUDA, and Kokkos, are used to put in writing kernels or capabilities for various architectures. On this article, we are going to use examples from Kokkos.
A bit about Kokkos — An open-source C++ programming mannequin for efficiency portability for writing Kernels: it’s applied as a template library on prime of CUDA, OpenMP, and different backends and goals to be descriptive, within the sense that we outline what we need to do relatively than prescriptive (how we need to do it). Kokkos Core supplies a programming mannequin for parallel algorithms that makes use of many-core chips and shares reminiscence throughout these cores.
A kernel has three elements —
- Sample — Construction of the computation: for, scan, discount, task-graph
- Execution coverage — How computations are executed: static scheduling, dynamic scheduling, thread groups.
- Computational Physique — Code which performs every unit of labor. e.g., loop physique
Sample and coverage drive computational physique. Within the instance beneath, used only for illustration, ‘for’ is the sample, the situation to manage the sample (factor=0; factor<n; ++factor) is the coverage, and the computational physique is the code executed inside the sample
for (factor=0; factor<n; ++factor){
whole = 0;
for(qp = 0; qp < numQPs; ++qp){
whole += dot(left[element][qp], proper[element][qp]);
}
elementValues[element] = whole;
}
The Kokkos framework permits builders to outline parameters and strategies primarily based on three key elements: the place the code will run (Execution Area), what reminiscence assets shall be utilized (Reminiscence Area), and the way information shall be structured and managed (Information Construction and Information administration).
We primarily focus on write the Kokkos kernel for the vector-matrix product to grasp how these elements are applied for various architectures.
However earlier than that, let’s focus on the constructing blocks of the kernel we need to write.
Reminiscence Area —
Kokkos supplies a variety of reminiscence house choices that allow customers to manage reminiscence administration and information placement on completely different computing platforms. Some generally used reminiscence areas are —
- HostSpace — This reminiscence house represents the CPU’s important reminiscence. It’s used for computations on the CPU and is usually the default reminiscence house when engaged on a CPU-based system.
- CudaSpace — CudaSpace is used for NVIDIA GPUs with CUDA. It supplies reminiscence allocation and administration for GPU units, permitting for environment friendly information switch and computation.
- CudaUVMSpac — For Unified Digital Reminiscence (UVM) methods, reminiscent of these on some NVIDIA GPUs, CudaUVMSpace allows the allocation of reminiscence accessible from each the CPU and GPU with out express information transfers. Cuda runtime routinely handles information motion at a efficiency hit.
It is usually important to debate reminiscence structure, which refers back to the group and association of knowledge in reminiscence. Kokkos supplies a number of reminiscence structure choices to assist customers optimize their information storage for numerous computations. Some generally used reminiscence layouts are —
- LayoutRight (often known as Row-Main) is the default reminiscence structure for multi-dimensional arrays in C and C++. In LayoutRight, the rightmost index varies most quickly in reminiscence. If no structure is chosen, the default structure for HostSpace is LayoutRight.
- LayoutLeft (often known as Column-Main) — In LayoutLeft, the leftmost index varies most quickly in reminiscence. If no structure is chosen, the default structure for CudaSpace is LayoutLeft.
Within the programmatic implementation beneath, we outlined reminiscence house and structure as macros primarily based on the compiler flag ENABLE_CUDA, which shall be True if we need to run our code on GPU and False for CPU.
// ENABLE_CUDA is a compile time argument with default worth true
#outline ENABLE_CUDA true// If CUDA is enabled, run the kernel on the CUDA (GPU) structure
#if outlined(ENABLE_CUDA) && ENABLE_CUDA
#outline MemSpace Kokkos::CudaSpace
#outline Format Kokkos::LayoutLeft
#else
// Outline default values or conduct when ENABLE_CUDA shouldn't be set or is fake
#outline MemSpace Kokkos::HostSpace
#outline Format Kokkos::LayoutRight
#endif
Information Construction and Information Administration —
Kokkos Views — In Kokkos, a “view” is a elementary information construction representing one-dimensional and multi-dimensional arrays, which can be utilized to retailer and entry information effectively. Kokkos views present a high-level abstraction for managing information and is designed to work seamlessly with completely different execution areas and reminiscence layouts.
// View for a second array of knowledge kind double
Kokkos::View<double**> myView("myView", numRows, numCols);
// Entry Views
myView(i, j) = 42.0;
double worth = myView(i, j);
Kokkos Mirroring method for information administration — Mirrors are views of equal arrays residing in attainable completely different reminiscence areas, which is once we want information in each CPU and GPU structure. This system is useful for eventualities like studying information from a file on the CPU and subsequently processing it on the GPU. Kokkos’ mirroring creates a mirrored view of the info, permitting seamless sharing between the CPU and GPU execution areas and facilitating information switch and synchronization.
To create a mirrored copy of the first information, we are able to use Kokkos’ create_mirror_view() operate. This operate generates a mirror view in a specified execution house (e.g., GPU) with the identical information kind and dimensions as the first view.
// Meant Computation -
// <y, A*x> = y^T * A * x
// Right here:
// y and x are vectors.
// A is a matrix.// Allocate y, x vectors and Matrix A on system
typedef Kokkos::View<double*, Format, MemSpace> ViewVectorType;
typedef Kokkos::View<double**, Format, MemSpace> ViewMatrixType;
// N and M are variety of rows and columns
ViewVectorType y( "y", N );
ViewVectorType x( "x", M );
ViewMatrixType A( "A", N, M );
// Create host mirrors of system views
ViewVectorType::HostMirror h_y = Kokkos::create_mirror_view( y );
ViewVectorType::HostMirror h_x = Kokkos::create_mirror_view( x );
ViewMatrixType::HostMirror h_A = Kokkos::create_mirror_view( A );
// Initialize y vector on host.
for ( int i = 0; i < N; ++i ) {
h_y( i ) = 1;
}
// Initialize x vector on host.
for ( int i = 0; i < M; ++i ) {
h_x( i ) = 1;
}
// Initialize A matrix on host.
for ( int j = 0; j < N; ++j ) {
for ( int i = 0; i < M; ++i ) {
h_A( j, i ) = 1;
}
}
// Deep copy host views to system views.
Kokkos::deep_copy( y, h_y );
Kokkos::deep_copy( x, h_x );
Kokkos::deep_copy( A, h_A );
Execution Area —
In Kokkos, the execution house refers back to the particular computing surroundings or {hardware} platform the place parallel operations and computations are executed. Kokkos abstracts the execution house, enabling code to be written in a descriptive method whereas adapting to numerous {hardware} platforms.
We focus on two major execution areas —
- Serial: The Serial execution house is a major and moveable choice appropriate for single-threaded CPU execution. It’s usually used for debugging, testing, and as a baseline for efficiency comparisons.
- Cuda: The Cuda execution house is used for NVIDIA GPUs and depends on CUDA know-how for parallel processing. It allows environment friendly GPU acceleration and administration of GPU reminiscence.
Both the ExecSpace will be outlined, or it may be decided dynamically primarily based on the Reminiscence house as beneath:
// Execution house decided primarily based on MemorySpace
utilizing ExecSpace = MemSpace::execution_space;
How can we use these constructing blocks to put in writing an precise kernel? Can we use it to check efficiency between completely different architectures?
For the aim of writing a kernel and efficiency comparability, we use following computation:
<y, A*x> = y^T * (A * x)Right here:
y and x are vectors.
A is a matrix.
<y, A*x> represents the inside product or dot product of vectors y
and the results of the matrix-vector multiplication A*x.
y^T denotes the transpose of vector y.
* denotes matrix-vector multiplication.
The kernel for this operation in Kokkos —
// Use a RangePolicy.
typedef Kokkos::RangePolicy<ExecSpace> range_policy;// The beneath code is run for a number of iterations throughout completely different
// architectures for time comparability
Kokkos::parallel_reduce( "yAx", range_policy( 0, N ),
KOKKOS_LAMBDA ( int j, double &replace ) {
double temp2 = 0;
for ( int i = 0; i < M; ++i ) {
temp2 += A( j, i ) * x( i );
}
replace += y( j ) * temp2;
}, outcome );
For the above kernel, parallel_reduce
serves because the sample, range_policy
defines the coverage, and the precise operations represent the computational physique.
I executed this kernel on a TACC Frontera node which has an NVIDIA Quadro RTX 5000 GPU. The experiments have been carried out with various values of N, which refers back to the lengths of the vectors y and x, and the variety of rows in matrix A. Computation was carried out 100 occasions to get notable outcomes, and the execution time of the kernel was recorded for each Serial (Host) and CUDA execution areas. I used ENABLE_CUDA
compiler flag to modify between execution environments: True for GPU/CUDA execution house and False for CPU/serial execution house. The outcomes of those experiments are offered beneath, with the corresponding speedup.
We discover that the speedup will increase considerably with the dimensions of N, indicating that the CUDA implementation turns into more and more advantageous for bigger drawback sizes.
That’s all for now! I hope this text has been useful in getting began on the correct foot in exploring the area of computing. Understanding the fundamentals of the GPU structure is essential, and this text introduces a method of writing cross-architectural code that I experimented with. Nevertheless, there are a number of strategies and applied sciences value exploring.
Whereas I’m not a subject professional, this text displays my studying journey from my temporary expertise working at TACC in Austin, TX. I welcome suggestions and discussions, and I might be completely happy to help you probably have any questions or need to study extra. Please seek advice from the superb assets beneath for additional studying. Completely happy computing!