Parallelization Using OpenMP

Sarthak Mahajan
6 min readDec 9, 2022

--

Speed up the serial for-loop computation by converting them into parallel for loops using OpenMP

Introduction

Parallel programming is a recent development in computing. The first parallel programming language, PARSCI, was created in 1983 by Tom Leighton of National Centre for Supercomputing Applications at the University of Illinois at Urbana-Champaign. Since then, many parallel programming languages have been developed and introduced to the industry. Today we are going to discuss a method that can speed up your code and make it more efficient by converting them into parallel for loops using OpenMP directives.

Introduction

OpenMP is a programming model that allows you to write parallel code in a sequential manner. It’s not easy to understand the difference between serial and parallel processing, but here are some basic terms:

  • Serial processing means that all the operations are performed one after another, without any interleaving. For example, if you have two tasks A and B in your program, then they will both run independently from each other until they finish at which point you expect them both to finish at exactly the same time.
  • Parallelism means that some operations are performed concurrently while others wait for their turn (or possibly later). For example, if you have three tasks A1 through A3 running on different processors in parallel with task B on another processor; then it’s possible for multiple threads within each task (A1-A3) to start executing simultaneously while one thread within task B has yet to begin executing its operation because there aren’t enough resources available for all three threads of task A1-A3 at once!

Parallelism in Computers

Parallel processing is a method of executing multiple tasks at once. The term “parallel” refers to the number of processors used in the computation, while “serial” refers to one processor at a time being used for each task being processed.

In order for parallel processing to work, each processor needs access to all of the data so that it can be manipulated as required by its task. This requires sharing memory between processors; this sharing is called communication between the processors and is achieved via shared memory or messages sent back and forth between them (see Figure 1).

Figure 1: Communication among several nodes in a computer system

Subsection: Serial and Parallel processing

The serial processing model is a single instruction being executed at a time, whereas the parallel processing model consists of multiple instructions being executed at one time. The advantage of parallelism is that it can be used to improve performance by reducing task switching costs. In addition, when comparing two algorithms with different execution times, we will find that the algorithm with more parallelism will always outperform its counterpart on average because there are fewer context switches between tasks (see Figure 1). This means that if you have an algorithm whose execution time takes less than 1 second per iteration and your processor has 8 cores/threads then you might want to consider using OpenMP instead of just having it run serially because it would take longer than expected!

Subsection: Why parallel processing? CPU performance and power consumption

We can speed up the serial for-loop computation by converting them into parallel for loops using OpenMP.

OpenMP is an open software library that allows you to write parallel code more efficiently and quickly. The main goal of this article is to demonstrate how it can be used to speed up the serial for-loop computation by converting them into parallel for loops using OpenMP.

What is OpenMP?

OpenMP is a programming model and a set of compiler directives for parallelization of code. It was originally developed by Intel Corporation in the early 1990s, and is now published as an open standard with ANSI INCITS/ISO. The OpenMP standard defines how to use multiple threads (or cores) within a single process (a program or function) to improve performance.

OpenMP allows you to divide your program into separate tasks, which can run concurrently on multiple processors at once. Each task consists of one statement or expression that takes advantage of all available cores by executing it on all available cores simultaneously through shared memory or task groups specified by the compiler directives __openmp_task_groups__ and __openmp_task__ respectively

OpenMP in C and C++

OpenMP is a standard for parallelization in C and C++. It enables you to write your code as if it were executed by multiple threads, but still allows individual threads to access shared variables and support functions. The compiler will automatically generate the appropriate code for this functionality when compiling with OpenMP enabled.

How to use OpenMP?

  • OpenMP is a programming model that allows you to run multiple threads in parallel on the same computer.
  • It uses a syntax similar to C/C++, and can be used in any language that supports OpenMP directives (e.g., Visual C++).

Subsection: Threads, work sharing and synchronizations using OpenMP directives

OpenMP directives are used to parallelize the code, share the work among threads and synchronize them.

In this section we will see how they can be used:

  • OpenMP directives tell the compiler how to split your program into tasks that are assigned as threads by a scheduling system (like MPI on Unix). The compiler generates a code sequence which is responsible for processing each task independently. Each task contains its own memory space, data structures and variables but only one thread at any given time runs in it. Each thread has its own stack where local variables are pushed before executing instructions; when an instruction finishes execution it pops off these stacks from bottom upwards until it reaches the top where another instruction waits for execution if there is any pending waiting state (such as semaphores or mutexes). This kind of interleaving between different parts of your program reduces overhead compared with sequential programming because now only one copy can run at any given time instead of many copies running simultaneously which would result in much higher costs due to synchronization overhead etc..

Subsection: How to parallelize the code? Using #pragma omp parallel for directive

You can use the #pragma omp directive to parallelize the code. The syntax is as follows:

#pragmaomp parallel for (n: integer)

In this case, n represents the number of threads you want to launch in your task. You can also specify whether they should be ordered or not using the ordered keyword. For example, if you have 100 threads available and want them all to start at once and execute their respective tasks sequentially:

  • For example: #pragma omp parallel for 1,2,-1

Differences between CUDA and OpenMP

The CUDA parallel computing platform and programming model was invented by NVIDIA. It has been widely adopted by the high performance computing (HPC) industry, including supercomputers, large-scale integrated systems, embedded devices and networked platforms.

The main difference between CUDA and OpenMP is that OpenMP is a standard for parallel programming while CUDA is a hardware-centric programming model:

  • OpenMP can be used on any processor or platform that supports threads; however it requires that each thread access memory in parallel with other threads through calls to shared libraries or kernel functions. While this means there are no restrictions on how many cores you want to use to run your code (as long as they support multiple threads), it also means that all communication between these cores must happen over an interconnect bus like PCI Express or PCIe Gen4/5 rather than using specialized communication channels like PCIe Direct Link Interconnects (DVI). This limits the amount of data transfer per second available from one component compared with another component within itself due to latency issues when passing data back/forth between components without buffers being created first before sending anything out into them

Conclusion

The conclusion is that OpenMP parallelized code has better performance than CUDA code.

However, there are some drawbacks to OpenMP:

  • Performance depends on the number of threads used in your application. This can be a problem if you have a fixed-size buffer size or if you don’t know how many threads will be needed for your program before starting it.
  • You need to make sure that all shared memory accesses are properly synchronized so that each thread gets its fair share of CPU time (and no other threads get blocked). This can result in increased overhead when using multiple processors with different amounts of RAM available per processor core (e.g., 16GB vs 4GB).

Conclusion

We have discussed how to use OpenMP directives in your programs. The parallel for loop will be faster than the serial one if you run it on a machine with more than one core CPU.

--

--