OpenMP 6.0

(openmp.org)

97 points | by mshachkov 19 hours ago

5 comments

  • phkahler 19 hours ago
    OpenMP is one of the easiest ways to make existing code run across CPU cores. In the simplest cases you simply add a single #pragma to C code and it goes N times faster. This is when you're running a function in a loop with no side effects. Some examples I've done:

    1) ray tracing. Looping over all the pixels in an image using ray tracing to determine the color of each pixel. The algorithm and data structures are complex but don't change during the rendering. N cores is about N times as fast.

    2) in Solvespace we had a small loop which calls a tessellation function on a bunch of NURBS surfaces. The function was appending triangles to a list, so I made a thread-local list for each call and combined them after to avoid writes to shared data structure. Again N times faster with very little effort.

    The code is also fine to build single threaded without change if you don't have OpenMP. Your compiler will just ignore the #pragmas.

    • pixelesque 15 hours ago
      > OpenMP is one of the easiest ways to make existing code run across CPU cores.

      True (or with Intel TBB), however as someone with a lot of experience optimising HPC algorithms for rendering, geometry processing and simulation, there are caveats, and quite often you can get situations where the existing code that is parallelised this way more naively can spend disproportionate amounts of CPU usage on spinlocks in OpenMP or TBB instead of doing useful work. (I've also noticed the same thing happening with Rayon in Rust).

      Sometimes I've looked at code other colleagues have "parallelised" this way, and they've said "yes, it's using multiple threads", but when you profile it with perf or vtune, it's clearly not really doing that much *useful* parallel work, and sometimes it's even slower than single-threaded from a wall-clock standpoint, and people just didn't check if it was faster, they just looked at the CPU usage, and didn't notice the spinlocks.

    • ddavis 19 hours ago
      OpenMP is great. I’ve done something similar to your second case (thread local objects that are filled in parallel and later combined). In the case of “OpenMP off” (pragmas ignored), is it possible to avoid the overhead of the thread local object essentially getting copied into the final object (since no OpenMP means only a single thread local object)? I avoided this by implementing a separate code path, but I’m just wondering if there are any tricks I missed that would allow still a single code path
      • Jtsummers 19 hours ago
        Give one of the threads (thread ID 0, for instance) special privileges. Its list is the one everything else is appended to, then there's only concatenation or copying if you have more than one thread.

        Or, pre-allocate the memory and let each thread write to its own subset of the final collection and avoid the combine step entirely. This works regardless of the number of threads you use so long as you know the maximum amount of memory you might need to allocate. If it has no calculable upper bound, you will need to use other techniques.

    • pjmlp 8 hours ago
      For C, C++ and Fortran users.

      So easiest depends on the target audience.

  • fxj 17 hours ago
    You can now (already in OpenMP5) use it to write GPU programs. Intels OneAPI uses OpenMP 5.5 to write programs for the Intel PonteVecchio GPUs which are on par to the Nvidia A100.

    https://www.intel.com/content/www/us/en/docs/oneapi/optimiza...

    gcc also provides support for NVidia and AMD GPUs

    https://gcc.gnu.org/wiki/Offloading

    Here is an example how you can use openmp for running a kernel on a nvidia A100:

    https://people.montefiore.uliege.be/geuzaine/INFO0939/notes/...

      #include <stdlib.h>
      #include <stdio.h>
      #include <omp.h>
    
      void saxpy(int n, float a, float *x, float *y) {
      double elapsed = -1.0 \* omp_get_wtime();
    
      // We don't need to map the variable a as scalars are firstprivate by default
      #pragma omp target teams distribute parallel for map(to:x[0:n]) map(tofrom:y[0:n])
      for(int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
      }
    
      elapsed += omp_get_wtime();
      printf("saxpy done in %6.3lf seconds.\n", elapsed);
      }
    
      int main() {
      int n = 2000000;
      float *x = (float*) malloc(n*sizeof(float));
      float *y = (float*) malloc(n*sizeof(float));
      float alpha = 2.0;
    
      #pragma omp parallel for
      for (int i = 0; i < n; i++) {
         x[i] = 1;
         y[i] = i;
      }
    
      saxpy(n, alpha, x, y);
    
      free(x);
      free(y);
    
      return 0;
      }
    • StrangeDoctor 15 hours ago
      I've wanted to mess around with those intel GPUs but haven't found a great source that deals with individuals/small orders.
  • Conscat 18 hours ago
    OpenMP was pivotal to my last workplace, but because some customers required MSVC, we barely had support for OpenMP 2.0.
    • pjmlp 7 hours ago
      It is a bit better now, but OpenMP is yet another standard born in UNIX HPC clusters, so I am even surprised that Microsoft bothered at all, it seems to be something to fill a check box.
  • dsp_person 18 hours ago
    I was just googling to see if there's any Emscripten/WASM implementation of OpenMP. The emscripten github issue [1] has a link to this "simpleomp" [2][3] where

    > In ncnn project, we implement a minimal openmp runtime for webassembly target

    > It only works for #pragma omp parallel for num_threads(N)

    [1] https://github.com/emscripten-core/emscripten/issues/13892

    [2] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.h

    [3] https://github.com/Tencent/ncnn/blob/master/src/simpleomp.cp...

  • pornel 12 hours ago
    I've used it a while ago, but got burned by very uneven support across compilers — MSVC required special tweaks, and old GCC would create crashy code without warning.

    It was okay for basic embarrassingly parallel for loops. I ended up not using any more advanced features, because apart from even worse compiler support, non-trivial multi-threading in C without any safeguards is just too easy to mess up.