omp | Open Monograph Press is open source software | Frontend Framework library

by pkp PHP Version: 3_4_0-0 License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | omp Summary

omp is a PHP library typically used in User Interface, Frontend Framework, React applications. omp has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Open Monograph Press

Support

Quality

Security

License

Reuse

Support

omp has a low active ecosystem.

It has 112 star(s) with 129 fork(s). There are 27 watchers for this library.

It had no major release in the last 6 months.

omp has no issues reported. There are 30 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of omp is 3_4_0-0

Quality

omp has 0 bugs and 0 code smells.

Security

omp has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

omp code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

omp does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

omp releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

It has 192973 lines of code, 1662 functions and 503 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed omp and discovered the below as its top functions. This is intended to give you an instant insight into omp implemented functionality, and help decide if they suit your requirements.

Create a product node
Process a product node
Create a book
Execute import command
Extracts metadata from a published publication format .
Initialize field options
Get actions for a row .
Updates a publication .
Get the pub id for a given publishing object .
Render a monography file

Get all kandi verified functions for this library.

omp Key Features

No Key Features are available at this moment for omp.

omp Examples and Code Snippets

No Code Snippets are available at this moment for omp.

Community Discussions

Trending Discussions on omp

Faulty benchmark, puzzling assembly

How do OpenMP thread ids work with recursion?

Problem of sorting OpenMP threads into NUMA nodes by experiment

How can I realize data local spawning or scheduling of tasks in OpenMP on NUMA CPUs?

Can OpenMP's SIMD directive vectorize indexing operations?

Are loop variables in Fortran SAVEd?

CMAKE_CXX_SOURCE_FILE_EXTENSIONS not working with thrust/cuda

Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values

Is MPI_Igather thread-safe?

Default Variable Scope in OpenMP

QUESTION

Faulty benchmark, puzzling assembly

Asked 2022-Mar-28 at 07:40

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

...

ANSWER

Answered 2022-Mar-25 at 19:33

1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384 GFLOP/S. (Using vectors of 4 doubles, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (as well as the __kmpc_end_serialized_parallel stuff).

There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.

As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

Once you're already in an if block, you unconditionally know the else block should not run, so you jmp over it instead of jcc (even if FLAGS were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc to it, then it jmps back to after the other side. That allows the fast path to be contiguous with no taken branches.

Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret instead of jumping to a ret. (And copies of any cleanup necessary, like restoring regs and stack pointer.)

Source https://stackoverflow.com/questions/71618068

QUESTION

How do OpenMP thread ids work with recursion?

Asked 2022-Mar-19 at 11:34

Here is a simple recursive program that splits into two for every recursive call. As expected, the result is 2 + 4 + 8 calls to rec, but the number of threads is always the same: two, and the ids bounce back and forth between 0 and one. I expected each recursive call to retain the id, and that 8 threads would be made in the end. What exactly is going on? Is there any issue with the code?

...

ANSWER

Answered 2022-Mar-19 at 11:34

Your code is working as expected by OpenMP standard. In OpenMP documentation you can find the following about omp_get_num_threads:

Summary: The omp_get_num_threads routine returns the number of threads in the current team.

Binding: The binding region for an omp_get_num_threads region is the innermost enclosing parallel region.

Effect: The omp_get_num_threads routine returns the number of threads in the team that is executing the parallel region to which the routine region binds. If called from the sequential part of a program, this routine returns 1.

omp_get_thread_num has the same binding region:

The binding region for an omp_get_thread_num region is the innermost enclosing parallel region.

It means that omp_get_num_threads and omp_get_thread_num bind to the innermost parallel region only, so it does not matter how many nested parallel regions are used. Each of your parallel regions is defined by #pragma omp parallel num_threads(2), therefore the return value of omp_get_num_threads is 2 (as long as you have enough threads available) and the return value of omp_get_thread_num is either 0 or 1.

Source https://stackoverflow.com/questions/71504894

QUESTION

Problem of sorting OpenMP threads into NUMA nodes by experiment

Asked 2022-Mar-08 at 14:26

I'm attempting to create a std::vector> with one set for each NUMA-node, containing the thread-ids obtained using omp_get_thread_num().

Topo:

Idea:

Create data which is larger than L3 cache,
set first touch using thread 0,
perform multiple experiments to determine the minimum access time of each thread,
extract the threads into nodes based on sorted access times and information about the topology.

Code: (Intel compiler, OpenMP)

...

ANSWER

Answered 2022-Mar-03 at 20:15

Put it shortly, the benchmark is flawed.

perform multiple experiments to determine the minimum access time of each thread

The term "minimum access time" is unclear here. I assume you mean "latency". The thing is your benchmark does not measure the latency. volatile tell to the compiler to read store data from the memory hierarchy. The processor is free to store the value in its cache and x86-64 processors actually do that (like almost all modern processors).

How do OMP_PLACES and OMP_PROC_BIND work?

You can find the documentation of both here and there. Put it shortly, I strongly advise you to set OMP_PROC_BIND=TRUE and OMP_PLACES="{0},{1},{2},..." based on the values retrieved from hw-loc. More specifically, you can get this from hwloc-calc which is a really great tool (consider using --li --po, and PU, not CORE because this is what OpenMP runtimes expect). For example you can query the PU identifiers of a given NUMA node. Note that some machines have very weird non-linear OS PU numbering and OpenMP runtimes sometimes fail to map the threads correctly. IOMP (OpenMP runtime of ICC) should use hw-loc internally but I found some bugs in the past related to that. To check the mapping is correct, I advise you to use hwloc-ps. Note that OMP_PLACES=cores does not guarantee that threads are not migrating from one core to another (even one on a different NUMA node) except if OMP_PROC_BIND=TRUE is set (or a similar setting). Note that you can also use numactl so to control the NUMA policies of your process. For example, you can tell to the OS not to use a given NUMA node or to interleave the allocations. The first touch policy is not the only one and may not be the default one on all platforms (on some Linux platforms, the OS can move the pages between the NUMA nodes so to improve locality).

Why is the above happening?

The code takes 4.38 ms to read 50 MiB in memory in each threads. This means 1200 MiB read from the node 0 assuming the first touch policy is applied. Thus the throughout should be about 267 GiB/s. While this seems fine at first glance, this is a pretty big throughput for such a processor especially assuming only 1 NUMA node is used. This is certainly because part of the fetches are done from the L3 cache and not the RAM. Indeed, the cache can partially hold a part of the array and certainly does resulting in faster fetches thanks to the cache associativity and good cache policy. This is especially true as the cache lines are not invalidated since the array is only read. I advise you to use a significantly bigger array to prevent this complex effect happening.

You certainly expect one NUMA node to have a smaller throughput due to remote NUMA memory access. This is not always true in practice. In fact, this is often wrong on modern 2-socket systems since the socket interconnect is often not a limiting factor (this is the main source of throughput slowdown on NUMA systems).

NUMA effect arise on modern platform because of unbalanced NUMA memory node saturation and non-uniform latency. The former is not a problem in your application since all the PUs use the same NUMA memory node. The later is not a problem either because of the linear memory access pattern, CPU caches and hardware prefetchers : the latency should be completely hidden.

Even more puzzling are the following environments and their outputs

Using 26 threads on a 24 core machine means that 4 threads have to be executed on two cores. The thing is hyper-threading should not help much in such a case. As a result, multiple threads sharing the same core will be slowed down. Because IOMP certainly pin thread to cores and the unbalanced workload, 4 threads will be about twice slower.

Having 48 threads cause all the threads to be slower because of a twice bigger workload.

Source https://stackoverflow.com/questions/71340798

QUESTION

How can I realize data local spawning or scheduling of tasks in OpenMP on NUMA CPUs?

Asked 2022-Feb-27 at 13:36

I have this simple self-contained example of a very rudimentary 2 dimensional stencil application using OpenMP tasks on dynamic arrays to represent an issue that I am having on a problem that is less of a toy problem.
There are 2 update steps in which for each point in the array 3 values are added from another array, from the corresponding location as well as the upper and lower neighbour location. The program is executed on a NUMA CPU with 8 cores and 2 hardware threads on each NUMA node. The array initializations are parallelized and using the environment variables OMP_PLACES=threads and OMP_PROC_BIND=spread the data is evenly distributed among the nodes' memories. To avoid data races I have set up dependencies so that for every section on the second update a task can only be scheduled if the relevant tasks for the sections from the first update step are executed. The computation is correct but not NUMA aware. The affinity clause seems to be not enough to change the scheduling as it is just a hint. I am also not sure whether using the single for task creation is efficient but all I know is it is the only way to make all task sibling tasks and thus the dependencies applicable.

Is there a way in OpenMP where I could parallelize the task creation under these constraints or guide the runtime system to a more NUMA-aware task scheduling? If not, it is also okay, I am just trying to see whether there are options available that use OpenMP in a way that it is intended and not trying to break it. I already have a version that only uses worksharing loops. This is for research.

NUMA NODE 0 pus {0-7,16-23} NUMA NODE 1 pus {8-15,24-31}

Environment Variables

...

ANSWER

Answered 2022-Feb-27 at 13:36

First of all, the state of the OpenMP task scheduling on NUMA systems is far from being great in practice. It has been the subject of many research project in the past and they is still ongoing project working on it. Some research runtimes consider the affinity hint properly and schedule the tasks regarding the NUMA node of the in/out/inout dependencies. However, AFAIK mainstream runtimes does not do much to schedule tasks well on NUMA systems, especially if you create all the tasks from a unique NUMA node. Indeed, AFAIK GOMP (GCC) just ignore this and actually exhibit a behavior that make it inefficient on NUMA systems (eg. the creation of the tasks is temporary stopped when there are too many of them and tasks are executed on all NUMA nodes disregarding the source/target NUMA node). IOMP (Clang/ICC) takes into account locality but AFAIK in your case, the scheduling should not be great. The affinity hint for tasks is not available upstream yet. Thus, GOMP and IOMP will clearly not behave well in your case as tasks of different steps will be often distributed in a way that produces many remote NUMA node accesses that are known to be inefficient. In fact, this is critical in your case as stencils are generally memory bound.

If you work with IOMP, be aware that its task scheduler tends to execute tasks on the same NUMA node where they are created. Thus, a good solution is to create the tasks in parallel. The tasks can be created in many threads bound to NUMA nodes. The scheduler will first try to execute the tasks on the same threads. Workers on the same NUMA node will try to steal tasks of the threads in the same NUMA node, and if there not enough tasks, then from any threads. While this work stealing strategy works relatively well in practice, there is a huge catch: tasks of different parent tasks cannot share dependencies. This limitation of the current OpenMP specification is a big issue for stencil codes (at least the ones that creates tasks working on different time steps). An alternative solution is to create tasks with dependencies from one thread and create smaller tasks from these tasks but due to the often bad scheduling of the big tasks, this approach is generally inefficient in practice on NUMA systems. In practice, on mainstream runtimes, the basic statically-scheduled loops behave relatively well on NUMA system for stencil although it is clearly sub-optimal for large stencils. This is sad and I hope this situation will improve in the current decade.

Be aware that data initialization matters a lot on NUMA systems as many platform actually allocate pages on the NUMA node performing the first touch. Thus the initialization has to be parallel (otherwise all the pages could be located on the same NUMA node causing a saturation of this node during stencil steps). The default policy is not the same on all platforms and some can move pages between NUMA nodes regarding their use. You can tweak the behavior with numactl. You can also fetch very useful information from the hw-loc tool. I strongly advise you to manually the location of all OpenMP threads using OMP_PROC_BIND=True and OMP_PLACES="{0},{1},...,{n}" where the OMP_PLACES string set can be generated from hw-loc regarding the actual platform.

For more information you can read this research paper (disclaimer: I am one of the authors). You can certainly find other similar research paper on the IWOMP conference and the Super-Computing conference too. You could try to use research runtime though most of them are not designed to be used in production (eg. KOMP which is not actively developed anymore, StarPU which mainly focus on GPUs and optimizing the critical path, OmpSS which is not fully compatible with OpenMP but try to extend it, PaRSEC which is mainly designed for linear algebra applications).

Source https://stackoverflow.com/questions/71284069

QUESTION

Can OpenMP's SIMD directive vectorize indexing operations?

Asked 2022-Feb-13 at 21:08

Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:

...

ANSWER

Answered 2022-Jan-19 at 18:55

Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.

First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.

Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).

To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).

Source https://stackoverflow.com/questions/70775446

QUESTION

Are loop variables in Fortran SAVEd?

Asked 2022-Feb-13 at 18:50

I have SAVE statement in subroutine or function. Also I haven't IMPLICIT NONE. In that case are loop variables in Fortran SAVEd? Example:

...

ANSWER

Answered 2022-Feb-13 at 18:50

A DO construct of the form of the question is not a scoping unit. Further, a "loop variable" for such a construct is not a privileged entity: it's just a variable in the scope containing the loop which is used in a particular way.

j in this case is variable which exists before the DO construct, and it exists after the DO construct.

That j is of implicitly declared type is not relevant: a variable which is not a construct or statement entity with implicit type exists throughout the scoping unit (and possibly others) in which the reference appears (see later). That is, the implicitly typed j here exists throughout the function, not just from the point where it appears in the DO construct.

The SAVE statement applies to the whole (non-inclusive) scoping unit. If j is a local variable (in this case, not use- or host-associated) then the SAVE saves it.

As Steve Lionel points out in a comment, in general, things can be much more complicated. Although not relevant to this question there are exceptions to what may otherwise be seen as blanket statements above.

Consider the similar concepts (switching to free form code for clarity)

Source https://stackoverflow.com/questions/71100970

QUESTION

CMAKE_CXX_SOURCE_FILE_EXTENSIONS not working with thrust/cuda

Asked 2022-Feb-08 at 22:27

Thrust allows for one to specify different backends at cmake configure time via the THRUST_DEVICE_SYSTEM flag. My problem is that I have a bunch of .cu files that I want to be compiled as regular c++ files when a user runs cmake with -DTHRUST_DEVICE_SYSTEM=OMP (for example). If I change the extension of the .cu files to .cpp they compile fine (indicating that I just need tell cmake to use the c++ compiler on the .cu files). But if I add .cu to CMAKE_CXX_SOURCE_FILE_EXTENSIONS then I get a CMake Error: Cannot determine link language for target "cuda_kernels". Here's a minimal cmake example:

...

ANSWER

Answered 2022-Feb-08 at 22:27

Why is cmake not respecting my CMAKE_CXX_SOURCE_FILE_EXTENSIONS changes?

The extension-to-language for is set as soon as is enabled by inspecting the value of the CMAKE__SOURCE_FILE_EXTENSIONS variable when the language detection module exits.

Unfortunately, there is no blessed way to override this list for CXX as it is hard-coded in Modules/CMakeCXXCompiler.cmake.in.

Perhaps the best way of working around the actual error would be to use the LANGUAGE source file property to tell CMake how to compile the individual CUDA files, like so:

Source https://stackoverflow.com/questions/71041077

QUESTION

Solve loop data dependency with SIMD - finding transitions between -1 and +1 in an int8_t array of sgn values

Asked 2022-Jan-30 at 15:58

I try to achieve performance improvement and made some good experience with SIMD. So far I was using OMP and like to improve my skills further using intrinsics.

In the following scenario, I failed to improve (even vectorize) due to a data dependency of a last_value required for a test of element n+1.

Environment is x64 having AVX2, so want to find a way to vectorize and SIMDfy a function like this.

...

ANSWER

Answered 2022-Jan-27 at 07:20

Complete vectorization is suboptimal for your case. It’s technically possible, but I think the overhead of producing that array of uint64_t values (I assume you’re compiling for 64 bit CPUs) will eat all the profit.

Instead, you should load chunks of 32 bytes, and immediately convert them to bit masks. Here’s how:

Source https://stackoverflow.com/questions/70853685

QUESTION

Is MPI_Igather thread-safe?

Asked 2022-Jan-26 at 06:22

I am trying to start a sequence of MPI_Igather calls (non-blocking collectives from MPI 4), do some work, and then whenever an Igather finishes, do some more work on that data.

That works fine, unless I start the Igather's from different threads on each MPI rank. In that case I often get a deadlock, even though I call MPI_Init_thread to make sure that MPI_THREAD_MULTIPLE is provided. Non-blocking collectives do not have a tag to match sends and receives, but I thought this is handled by the MPI_Request object associated with each collective operation?

The most simple example I found failing is this:

given np MPI ranks, each of which has a local array of length np
start one MPI_Igather for each element i, gathering those elements on process i.
the i-loop is parallelized using OpenMP
then call MPI_Waitall() to finish all communication. This is where the program hangs when setting OMP_NUM_THREADS to a value larger than 1.

I made two variants of this program: igather_threaded.cpp (the code below), which behaves as described above, and igather_threaded_v2.cpp, which gathers everything on MPI rank 0. This version does not deadlock, but the data is not ordered correctly either.

igather_threaded.cpp:

...

ANSWER

Answered 2022-Jan-17 at 05:03

In your scheme, because the collectives are started in different threads, they can be started in any order. Having different requests is not enough to disambiguate them: MPI insists that they are started in the same order on each process.

You could fix this by:

Source https://stackoverflow.com/questions/70697604

QUESTION

Default Variable Scope in OpenMP

Asked 2022-Jan-14 at 11:02

I am learning OpenMP in C programming and trying to understand the default scope of variables in different cases.

Given this code, I am a bit confused about the scope of pointers and their respective variable values. For example *x and x, and *y and y. From what I understand, variables of functions called from parallel regions are "Private" except for static variable which become "shared".

Following this rule, the variables *x, *y and z, should be "Private" since those are variables of the function 'fvg2' which is called from a parallel region. Additionally, the variables x and y should be be also private.

Is my reasoning correct?

Also in parallel programming is it best practice to explicitly specify every variable scope instead of relying on defaults?

Thank you.

...

ANSWER

Answered 2022-Jan-13 at 20:24

Variables declared inside a function will indeed be private, but your a array comes from a shared array variable, so variable x, being a function parameter is shared: every call of the function gets a pointer into the same original array a.

Source https://stackoverflow.com/questions/70702531

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install omp

You can download it from GitHub.
PHP requires the Visual C runtime (CRT). The Microsoft Visual C++ Redistributable for Visual Studio 2019 is suitable for all these PHP versions, see visualstudio.microsoft.com. You MUST download the x86 CRT for PHP x86 builds and the x64 CRT for PHP x64 builds. The CRT installer supports the /quiet and /norestart command-line switches, so you can also script it.