icpc | : balloon : Problems solved during practice for ICPC | Learning library

by juanplopes C++ Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | icpc Summary

icpc is a C++ library typically used in Tutorial, Learning applications. icpc has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

:balloon: Problems solved during practice for ICPC

Support

Quality

Security

License

Reuse

Support

icpc has a low active ecosystem.

It has 101 star(s) with 29 fork(s). There are 21 watchers for this library.

It had no major release in the last 6 months.

icpc has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of icpc is current.

Quality

icpc has 0 bugs and 0 code smells.

Security

icpc has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

icpc code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

icpc does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

icpc releases are not available. You will need to build from source code and install.

It has 360 lines of code, 1 functions and 4 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of icpc

Get all kandi verified functions for this library.

icpc Key Features

No Key Features are available at this moment for icpc.

icpc Examples and Code Snippets

No Code Snippets are available at this moment for icpc.

Community Discussions

Trending Discussions on icpc

icpc error in compiling over-aligned dynamic allocated variables

Faulty benchmark, puzzling assembly

Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

Old task from local ICPC "Computer Class"

N-body OpenCL code : error CL_OUT_OF_HOST_MEMORY with GPU card NVIDIA A6000

How to use MAGMA with NVIDIA GPU card instead of CPU LAPACKE to inverse large matrix

OpenMP-behavior: Using ICC and GCC gives significantly different run times

Intel C++ compiler and Wnon-virtual-dtor flag gives (wrong?) warning

Libtorch works with g++, but fails with Intel compiler

Intel compiler inline size

QUESTION

icpc error in compiling over-aligned dynamic allocated variables

Asked 2022-Mar-30 at 08:55

I am trying to compile a code in C++, that uses over-aligned variables. If I try to compile the following code (a MWE)

...

ANSWER

Answered 2022-Mar-30 at 08:55

It seems that icpc fails to conform with the standard with aligned allocations. Quoting from the documentations for version 2021.5:

In this release of the compiler, all that is necessary in order to get correct dynamic allocation for aligned data is to include a new header:

#include

After this header is included, a new-expression for any aligned type will automatically allocate memory with the alignment of that type.

Link: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming-guide/automatically-aligned-dynamic-allocation/automatically-aligned-dynamic-allocation-1.html

Live demo: https://godbolt.org/z/5xMqKGrTG

This section is missing in the documentation of icpx: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top.html.

Source https://stackoverflow.com/questions/71673959

QUESTION

Faulty benchmark, puzzling assembly

Asked 2022-Mar-28 at 07:40

Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.

Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.

Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6 (n^3 for matrix-matrix multiplication, performed on n slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.

Code snippet in main function:

...

ANSWER

Answered 2022-Mar-25 at 19:33

1 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384 GFLOP/S. (Using vectors of 4 doubles, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb> (as well as the __kmpc_end_serialized_parallel stuff).

There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.

As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)

Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?

Once you're already in an if block, you unconditionally know the else block should not run, so you jmp over it instead of jcc (even if FLAGS were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc to it, then it jmps back to after the other side. That allows the fast path to be contiguous with no taken branches.

Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?

Duplicate ret comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret instead of jumping to a ret. (And copies of any cleanup necessary, like restoring regs and stack pointer.)

Source https://stackoverflow.com/questions/71618068

QUESTION

Using Intel oneAPI MKL to perform sparse matrix with dense vector multiplication

Asked 2022-Mar-14 at 11:03

I am developing a program, making heavy use of Armadillo library. I have the 10.8.2 version, linked against Intel oneAPI MKL 2022.0.2. At some point, I need to perform many sparse matrix times dense vector multiplications, both of which are defined using Armadillo structures. I have found this point to be a probable bottleneck, and was being curious if replacing the Armadillo multiplication with "bare bones" sparse CBLAS routines from MKL (mkl_sparse_d_mv) would speed things up. But in order to do so, I need to convert from Armadillo's SpMat to something that MKL understands. As per Armadillo docs, sparse matrices are stored in CSC format, so I have tried mkl_sparse_d_create_csc. My attempt at this is below:

...

ANSWER

Answered 2022-Mar-14 at 11:03

Yes, the cols_end array is incorrect as pointed out by CJR. They should be indexed as 2,3,4,5. Please see the documentation regarding the parameter to the function mkl_sparse_d_create_csc

cols_end:

This array contains col indices, such that cols_end[i] - ind - 1 is the last index of col i in the arrays values and row_indx. ind takes 0 for zero-based indexing and 1 for one-based indexing.

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/blas-and-sparse-blas-routines/inspector-executor-sparse-blas-routines/matrix-manipulation-routines/mkl-sparse-create-csc.html

Change this line

cols_end[i] = static_cast((--X.end_col(i)).pos());

cols_end[i] = static_cast((X.end_col(i)).pos());

Now recompile and run the code. I've tested it and it is showing the correct results. Image with results and compilation command

Source https://stackoverflow.com/questions/71456912

QUESTION

Old task from local ICPC "Computer Class"

Asked 2021-Dec-09 at 14:32

There is a task on which I have been racking my brains for three days. The task is called Computer Class (not to be confused with other tasks from ICPC, there are many similarly named tasks).

Problem conditions: There are n * m (arranged respectively in m rows and n desks in each row) desks and students. Each student from 1 to n * m has a unique (including n * m) number. It is necessary to arrange the students so that the difference between the numbers of neighbors is more than one (those who are from above, from below and those who are from the left, the right are also neighbors).

///////////////////////////////////////////////////////////// forgot to mention: numbers are limited to a radius of 1≤n, m ≤50. And the time for the program to work is half a second (and 5 seconds in real time). /////////////////////////////////////////////////////////////

As a result: need to write a program (preferably in the languages python and C ++) an algorithm capable of accepting two numbers (n and m) to give any suitable order of the students' arrangement (if it is impossible to give such an arrangement -1).

For example: given: 3, 4 taked:

Or given: 1, 2 taked: -1

My attempts to solve the problem: I used the method of creating all sequences from n numbers (the method is taken from one book) and using checking functions to find the desired sequence. When I did it with a dynamic array, I succeeded, or so I thought. The program taking toooo long when it was necessary to find an answer from a sequence of 15 (3 * 5) or more numbers.

After the advice I received and much thought, I broke my head and came up with a code that would completely solve this problem. And now how to close the question? (there is code):

...

ANSWER

Answered 2021-Dec-09 at 14:32

I decided to just first collect all even and then odd ones on the screen as an answer, and in those units of cases in which this did not work, I simply hardcode (there were only two of them: 3x3; 2x3). You can see the code above.

Source https://stackoverflow.com/questions/70258567

QUESTION

N-body OpenCL code : error CL_OUT_OF_HOST_MEMORY with GPU card NVIDIA A6000

Asked 2021-Oct-10 at 05:52

I would like to make run an old N-body which uses OpenCL.

I have 2 cards NVIDIA A6000 with NVLink, a component which binds from an hardware (and maybe software ?) point of view these 2 GPU cards.

But at the execution, I get the following result:

Here is the kernel code used (I have put pragma that I estimate useful for NVIDIA cards):

...

ANSWER

Answered 2021-Aug-07 at 12:36

Your kernel code looks good and the cache tiling implementation is correct. Only make sure that the number of bodies is a multiple of local size, or alternatively limit the inner for loop to the global size additionally.

OpenCL allows usage of multiple devices in parallel. You need to make a thread with a queue for each device separately. You also need to take care of device-device communications and synchronization manually. Data transfer happens over PCIe (you also can do remote direct memory access); but you can't use NVLink with OpenCL. This should not be an issue in your case though as you need only little data transfer compared to the amount of arithmetic.

A few more remarks:

In many cases N-body requires FP64 to sum up the forces and resolve positions at very different length scales. However on the A6000, FP64 performance is very poor, just like on GeForce Ampere. FP32 would be significantly (~64x) faster, but is likely insufficient in terms of accuracy here. For efficient FP64 you would need an A100 or MI100.
Instead of 1.0/sqrt, use rsqrt. This is hardware supported and almost as fast as a multiplication.
Make sure to use either FP32 float (1.0f) or FP64 double (1.0) literals consistently. Using double literals with float triggers double arithmetic and casting of the result back to float which is much slower.

EDIT: To help you out with the error message: Most probably the error at clCreateKernel (what value does status have after calling clCreateKernel?) hints that program is invalid. This might be because you give clBuildProgram a vector of 2 devices, but set the number of devices to only 1 and also have context only for 1 device. Try

Source https://stackoverflow.com/questions/68548139

QUESTION

How to use MAGMA with NVIDIA GPU card instead of CPU LAPACKE to inverse large matrix

Asked 2021-Oct-04 at 17:25

I need to inverse large matrices and I would like to modify my current LAPACKE version routine in order to exploit the powerfull of a GPU NVIDIA Card.

Indeed, my LAPACKE routines works well for relative small matrices but not for large matrices.

Below thr implementation of this LAPACKE routine :

...

ANSWER

Answered 2021-Aug-26 at 04:44

Try using magma sgetri gpu - inverse matrix in single precision, GPU interface. This function computes in single precision the inverse A^−1 of an m × m matrix A.

Source https://stackoverflow.com/questions/68780939

QUESTION

OpenMP-behavior: Using ICC and GCC gives significantly different run times

Asked 2021-Sep-16 at 19:17

For a small benchmark of OpenMP on an i7-6700K I wrote the following code:

...

ANSWER

Answered 2021-Sep-16 at 11:54

The problem comes from collapse(2) clause and is also related to the code auto-vectorization. Indeed, both compilers are not able to auto-vectorize the loop with the collapse, but ICC use a very expensive idiv instruction in the middle of the hot loop (which is very bad) while GCC produce a better code. This comes from the collapse(2) clause which is not well optimized (on many compilers). You can see that on GodBold. Note that optimizing a kernel using a collapse(2) clause is not easy since the compiler do not know the bound of the loops and so the associated cost (as well as the divider for the modulus).

Without the collapse(2), GCC is able to vectorize the loop successfully but surprisingly not ICC. Hopefully, we can help ICC using the simd directive. Once used, the two compilers generate a relatively good code. It is still not optimal because size_t is generally 8 bytes and int is 4 bytes on mainstream x86-64 platforms and the loop comparison of the different types makes the code harder to vectorize efficiently as well as to produce the best scalar instructions. You can use a temporary variable to fix that. You can see the resulting assembly code here.

Note that the assembly generated by ICC is very good once the code is fixed. The code is memory bound and the final code should saturate the RAM with only few threads. Even the L3 cache should be saturated with the ICC produced assembly if the input array would fit into it.

Here is the fixed code:

Source https://stackoverflow.com/questions/69203494

QUESTION

Intel C++ compiler and Wnon-virtual-dtor flag gives (wrong?) warning

Asked 2021-May-09 at 19:44

I was trying to compile some code of mine which, in g++ (with the --Wnon-virtual-dtor Flag) compiled just fine. Also, my IDE and clang-tidy didn't warn me (I see that this might be false of course).

When I tried to compile the same code with Intel's icpc (actually this one icpc (ICC) 19.1.2.254 20200623) I got a warning which I was now able to track down - I'm wondering whether I'm somehow at fault or whether that warning is actually not correct in my case.

I wrote a somewhat minimal example of my class hierarchy:

...

ANSWER

Answered 2021-May-09 at 19:44

Ok, so after also posting this question on the Intel forum - seems to be (a very much non critical) bug in the compiler - the workaround would be writing both, virtual and override.

This is discouraged by the item C.128 in the cpp core guidelines (same link as in the comments) but not problematic.

Source https://stackoverflow.com/questions/67415240

QUESTION

Libtorch works with g++, but fails with Intel compiler

Asked 2021-Feb-14 at 03:57

I want to use a neural network developed in Python (PyTorch) in a Fortran program. My OS is Ubuntu 18.04.

What I am doing:

save it as torchscript: TurbNN.pt
call it from c++ program: call_ts.cpp, call_ts.h
call c++ program from Fortran program (using bind©): main.f90

I successfully compiled the codes using CMake (3.19.4) and g++ (7.5.0). However, I cannot compile them using Intel compilers (HPCKit 2021.1.0.2684):

...

ANSWER

Answered 2021-Feb-14 at 03:57

Do you see cxx11 in the linker errors? It looks like your libcall_ts_cpp is compiled in a way that expects the new C++11 ABI for std::string but perhaps the library where those functions are implemented was compiled with the old ABI. Here's a PyTorch forum post about the same problem: https://discuss.pytorch.org/t/issues-linking-with-libtorch-c-11-abi/29510/11

The solution is to download a new copy of the PyTorch libraries built with the new C++11 ABI.

Source https://stackoverflow.com/questions/66192285

QUESTION

Intel compiler inline size

Asked 2021-Jan-25 at 18:21

I have been compiling my code for some time with g++ and then moved to Intel's icpc compiler. With icpc I kept getting the following warnings:

...

ANSWER

Answered 2021-Jan-25 at 18:21

My question is whether it is always a good practice to remove sizes on inlining and inline as much as possible?

No, it is not always a good pratice to remove size limits on inlining nor to inline as much as possible.

Ideally, inlining should be done only when it improves performance.

Are there situations at all where imposing an inlining limit is useful?

If a function is very large, and it is called from many contexts, then inlining such function to all of those contexts will bloat the executable. If the executable itself is let's say several gigabytes because of inlining, then loading the program from the disk may become the bottleneck.

In less pathological cases, the trade-offs are more subtle. The way to find out optimal limits is measurement. Profile guided optimisation can give the optimiser more useful heuristics than simple hard limits.

Source https://stackoverflow.com/questions/65889668

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install icpc

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: