ICPC | International Clopidogrel Pharmacogenomics Consortium
kandi X-RAY | ICPC Summary
kandi X-RAY | ICPC Summary
This code will read in Excel data submitted by project participants, store it in a PostgreSQL database, and export information into defined reports. This project was developed in IntelliJ IDEA and has configuration files checked in for running easily in that envirionment.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Main entry point
- Post process properties for a sample
- Calculates the creatinine category for a sample
- Calculate the summary for a subject
- Main method for testing purposes
- Normalize the value
- Closes the given Session
- Parse the command line arguments
- Starts the analysis
- Returns the first row of the Excel sheet
- Analyzes the template and generates HTML output
- Starts the report generator
- Generates the report
- Parses the command line arguments
- The main method
- Shutdown the driver manager
- Gets the validation pattern
- Looks up a metacabilizer by their genotypes
- Parses and starts data
- Makes a regular expression that can be used as a regular expression
- Initializes the styles
- Creates a pattern that matches all sources
- Initializes the internal data sheet
ICPC Key Features
ICPC Examples and Code Snippets
Community Discussions
Trending Discussions on ICPC
QUESTION
I am trying to compile a code in C++, that uses over-aligned variables. If I try to compile the following code (a MWE)
...ANSWER
Answered 2022-Mar-30 at 08:55It seems that icpc fails to conform with the standard with aligned allocations. Quoting from the documentations for version 2021.5:
In this release of the compiler, all that is necessary in order to get correct dynamic allocation for aligned data is to include a new header:
#include
After this header is included, a new-expression for any aligned type will automatically allocate memory with the alignment of that type.
Live demo: https://godbolt.org/z/5xMqKGrTG
This section is missing in the documentation of icpx: https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-cpp-compiler-dev-guide-and-reference/top.html.
QUESTION
Assembly novice here. I've written a benchmark to measure the floating-point performance of a machine in computing a transposed matrix-tensor product.
Given my machine with 32GiB RAM (bandwidth ~37GiB/s) and Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz (Turbo 4.0GHz) processor, I estimate the maximum performance (with pipelining and data in registers) to be 6 cores x 4.0GHz = 24GFLOP/s. However, when I run my benchmark, I am measuring 127GFLOP/s, which is obviously a wrong measurement.
Note: in order to measure the FP performance, I am measuring the op-count: n*n*n*n*6
(n^3
for matrix-matrix multiplication, performed on n
slices of complex data-points i.e. assuming 6 FLOPs for 1 complex-complex multiplication) and dividing it by the average time taken for each run.
Code snippet in main function:
...ANSWER
Answered 2022-Mar-25 at 19:331 FP operation per core clock cycle would be pathetic for a modern superscalar CPU. Your Skylake-derived CPU can actually do 2x 4-wide SIMD double-precision FMA operations per core per clock, and each FMA counts as two FLOPs, so theoretical max = 16 double-precision FLOPs per core clock, so 24 * 16 = 384
GFLOP/S. (Using vectors of 4 double
s, i.e. 256-bit wide AVX). See FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2
There is a a function call inside the timed region, callq 403c0b <_Z12do_timed_runRKmRd+0x1eb>
(as well as the __kmpc_end_serialized_parallel
stuff).
There's no symbol associated with that call target, so I guess you didn't compile with debug info enabled. (That's separate from optimization level, e.g. gcc -g -O3 -march=native -fopenmp
should run the same asm, just have more debug metadata.) Even a function invented by OpenMP should have a symbol name associated at some point.
As far as benchmark validity, a good litmus test is whether it scales reasonably with problem size. Unless you exceed L3 cache size or not with a smaller or larger problem, the time should change in some reasonable way. If not, then you'd worry about it optimizing away, or clock speed warm-up effects (Idiomatic way of performance evaluation? for that and more, like page-faults.)
- Why are there non-conditional jumps in code (at 403ad3, 403b53, 403d78 and 403d8f)?
Once you're already in an if
block, you unconditionally know the else
block should not run, so you jmp
over it instead of jcc
(even if FLAGS
were still set so you didn't have to test the condition again). Or you put one or the other block out-of-line (like at the end of the function, or before the entry point) and jcc
to it, then it jmp
s back to after the other side. That allows the fast path to be contiguous with no taken branches.
- Why are there 3 retq instances in the same function with only one return path (at 403c0a, 403ca4 and 403d26)?
Duplicate ret
comes from "tail duplication" optimization, where multiple paths of execution that all return can just get their own ret
instead of jumping to a ret
. (And copies of any cleanup necessary, like restoring regs and stack pointer.)
QUESTION
I am developing a program, making heavy use of Armadillo library. I have the 10.8.2 version, linked against Intel oneAPI MKL 2022.0.2. At some point, I need to perform many sparse matrix times dense vector multiplications, both of which are defined using Armadillo structures. I have found this point to be a probable bottleneck, and was being curious if replacing the Armadillo multiplication with "bare bones" sparse CBLAS routines from MKL (mkl_sparse_d_mv) would speed things up. But in order to do so, I need to convert from Armadillo's SpMat to something that MKL understands. As per Armadillo docs, sparse matrices are stored in CSC format, so I have tried mkl_sparse_d_create_csc. My attempt at this is below:
...ANSWER
Answered 2022-Mar-14 at 11:03Yes, the cols_end array is incorrect as pointed out by CJR. They should be indexed as 2,3,4,5. Please see the documentation regarding the parameter to the function mkl_sparse_d_create_csc
cols_end:
This array contains col indices, such that cols_end[i] - ind - 1 is the last index of col i in the arrays values and row_indx. ind takes 0 for zero-based indexing and 1 for one-based indexing.
Change this line
cols_end[i] = static_cast((--X.end_col(i)).pos());
to
cols_end[i] = static_cast((X.end_col(i)).pos());
Now recompile and run the code. I've tested it and it is showing the correct results. Image with results and compilation command
QUESTION
There is a task on which I have been racking my brains for three days. The task is called Computer Class (not to be confused with other tasks from ICPC, there are many similarly named tasks).
Problem conditions: There are n * m (arranged respectively in m rows and n desks in each row) desks and students. Each student from 1 to n * m has a unique (including n * m) number. It is necessary to arrange the students so that the difference between the numbers of neighbors is more than one (those who are from above, from below and those who are from the left, the right are also neighbors).
///////////////////////////////////////////////////////////// forgot to mention: numbers are limited to a radius of 1≤n, m ≤50. And the time for the program to work is half a second (and 5 seconds in real time). /////////////////////////////////////////////////////////////
As a result: need to write a program (preferably in the languages python and C ++) an algorithm capable of accepting two numbers (n and m) to give any suitable order of the students' arrangement (if it is impossible to give such an arrangement -1).
For example: given: 3, 4 taked:
Or given: 1, 2 taked: -1
My attempts to solve the problem: I used the method of creating all sequences from n numbers (the method is taken from one book) and using checking functions to find the desired sequence. When I did it with a dynamic array, I succeeded, or so I thought. The program taking toooo long when it was necessary to find an answer from a sequence of 15 (3 * 5) or more numbers.
After the advice I received and much thought, I broke my head and came up with a code that would completely solve this problem. And now how to close the question? (there is code):
...ANSWER
Answered 2021-Dec-09 at 14:32I decided to just first collect all even and then odd ones on the screen as an answer, and in those units of cases in which this did not work, I simply hardcode (there were only two of them: 3x3; 2x3). You can see the code above.
QUESTION
I would like to make run an old N-body which uses OpenCL.
I have 2 cards NVIDIA A6000 with NVLink
, a component which binds from an hardware (and maybe software ?) point of view these 2 GPU cards.
But at the execution, I get the following result:
Here is the kernel code used (I have put pragma that I estimate useful for NVIDIA cards):
...ANSWER
Answered 2021-Aug-07 at 12:36Your kernel code looks good and the cache tiling implementation is correct. Only make sure that the number of bodies is a multiple of local size, or alternatively limit the inner for loop to the global size additionally.
OpenCL allows usage of multiple devices in parallel. You need to make a thread with a queue for each device separately. You also need to take care of device-device communications and synchronization manually. Data transfer happens over PCIe (you also can do remote direct memory access); but you can't use NVLink with OpenCL. This should not be an issue in your case though as you need only little data transfer compared to the amount of arithmetic.
A few more remarks:
- In many cases N-body requires FP64 to sum up the forces and resolve positions at very different length scales. However on the A6000, FP64 performance is very poor, just like on GeForce Ampere. FP32 would be significantly (~64x) faster, but is likely insufficient in terms of accuracy here. For efficient FP64 you would need an A100 or MI100.
- Instead of 1.0/sqrt, use rsqrt. This is hardware supported and almost as fast as a multiplication.
- Make sure to use either FP32 float (1.0f) or FP64 double (1.0) literals consistently. Using double literals with float triggers double arithmetic and casting of the result back to float which is much slower.
EDIT: To help you out with the error message: Most probably the error at clCreateKernel
(what value does status
have after calling clCreateKernel
?) hints that program
is invalid. This might be because you give clBuildProgram
a vector of 2 devices, but set the number of devices to only 1 and also have context
only for 1 device. Try
QUESTION
I need to inverse large matrices and I would like to modify my current LAPACKE version routine in order to exploit the powerfull of a GPU NVIDIA Card.
Indeed, my LAPACKE routines works well for relative small matrices but not for large matrices.
Below thr implementation of this LAPACKE routine :
...ANSWER
Answered 2021-Aug-26 at 04:44Try using magma sgetri gpu
- inverse matrix in single precision, GPU
interface.
This function computes in single precision the inverse A^−1 of an m × m
matrix A.
QUESTION
For a small benchmark of OpenMP on an i7-6700K I wrote the following code:
...ANSWER
Answered 2021-Sep-16 at 11:54The problem comes from collapse(2)
clause and is also related to the code auto-vectorization. Indeed, both compilers are not able to auto-vectorize the loop with the collapse, but ICC use a very expensive idiv
instruction in the middle of the hot loop (which is very bad) while GCC produce a better code. This comes from the collapse(2)
clause which is not well optimized (on many compilers). You can see that on GodBold. Note that optimizing a kernel using a collapse(2)
clause is not easy since the compiler do not know the bound of the loops and so the associated cost (as well as the divider for the modulus).
Without the collapse(2)
, GCC is able to vectorize the loop successfully but surprisingly not ICC. Hopefully, we can help ICC using the simd
directive. Once used, the two compilers generate a relatively good code. It is still not optimal because size_t
is generally 8 bytes and int
is 4 bytes on mainstream x86-64 platforms and the loop comparison of the different types makes the code harder to vectorize efficiently as well as to produce the best scalar instructions. You can use a temporary variable to fix that. You can see the resulting assembly code here.
Note that the assembly generated by ICC is very good once the code is fixed. The code is memory bound and the final code should saturate the RAM with only few threads. Even the L3 cache should be saturated with the ICC produced assembly if the input array would fit into it.
Here is the fixed code:
QUESTION
I was trying to compile some code of mine which, in g++ (with the --Wnon-virtual-dtor Flag) compiled just fine. Also, my IDE and clang-tidy didn't warn me (I see that this might be false of course).
When I tried to compile the same code with Intel's icpc (actually this one icpc (ICC) 19.1.2.254 20200623
) I got a warning which I was now able to track down - I'm wondering whether I'm somehow at fault or whether that warning is actually not correct in my case.
I wrote a somewhat minimal example of my class hierarchy:
...ANSWER
Answered 2021-May-09 at 19:44Ok, so after also posting this question on the Intel forum - seems to be (a very much non critical) bug in the compiler - the workaround would be writing both, virtual
and override
.
This is discouraged by the item C.128 in the cpp core guidelines (same link as in the comments) but not problematic.
QUESTION
I want to use a neural network developed in Python (PyTorch) in a Fortran program. My OS is Ubuntu 18.04.
What I am doing:
- save it as torchscript: TurbNN.pt
- call it from c++ program: call_ts.cpp, call_ts.h
- call c++ program from Fortran program (using bind©): main.f90
I successfully compiled the codes using CMake (3.19.4) and g++ (7.5.0). However, I cannot compile them using Intel compilers (HPCKit 2021.1.0.2684):
...ANSWER
Answered 2021-Feb-14 at 03:57Do you see cxx11
in the linker errors? It looks like your libcall_ts_cpp
is compiled in a way that expects the new C++11 ABI for std::string but perhaps the library where those functions are implemented was compiled with the old ABI. Here's a PyTorch forum post about the same problem: https://discuss.pytorch.org/t/issues-linking-with-libtorch-c-11-abi/29510/11
The solution is to download a new copy of the PyTorch libraries built with the new C++11 ABI.
QUESTION
I have been compiling my code for some time with g++
and then moved to Intel's icpc
compiler. With icpc
I kept getting the following warnings:
ANSWER
Answered 2021-Jan-25 at 18:21My question is whether it is always a good practice to remove sizes on inlining and inline as much as possible?
No, it is not always a good pratice to remove size limits on inlining nor to inline as much as possible.
Ideally, inlining should be done only when it improves performance.
Are there situations at all where imposing an inlining limit is useful?
If a function is very large, and it is called from many contexts, then inlining such function to all of those contexts will bloat the executable. If the executable itself is let's say several gigabytes because of inlining, then loading the program from the disk may become the bottleneck.
In less pathological cases, the trade-offs are more subtle. The way to find out optimal limits is measurement. Profile guided optimisation can give the optimiser more useful heuristics than simple hard limits.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install ICPC
You can use ICPC like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the ICPC component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page