core2 | Alloc support
kandi X-RAY | core2 Summary
kandi X-RAY | core2 Summary
Ever wanted a Cursor or the Error trait in no_std? Well now you can have it. A 'fork' of Rust's std modules for no_std environments, with the added benefit of optionally taking advantage of alloc. The goal of this crate is to provide a stable interface for building I/O and error trait functionality in no_std environments. The current code corresponds to the most recent stable API of Rust 1.47.0. It is also a goal to achieve a true alloc-less experience, with opt-in alloc support. This crate works on stable with some limitations in functionality, and nightly without limitations by adding the relevant feature flag. This crate is std by default -- use no default features to get no_std mode.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of core2
core2 Key Features
core2 Examples and Code Snippets
Community Discussions
Trending Discussions on core2
QUESTION
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
...ANSWER
Answered 2022-Apr-10 at 02:47The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) + v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] += arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
QUESTION
I'm trying to wrap my head around the issue of memory barriers right now. I've been reading and watching videos about the subject, and I want to make sure I understand it correctly, as well as ask a question or two.
I start with understanding the problem accurately. Let's take the following classic example as the basis for the discussion: Suppose we have 2 threads running on 2 different cores
This is pseudo-code!
We start with int f = 0; int x = 0;
and then run those threads:
ANSWER
Answered 2022-Mar-28 at 09:06For my point of view you missed the most important thing!
As the compiler did not see that the change of x
nor f
has any side effect, the compiler also can optimize all of that away. And also the loop with condition f==0
will result in "nothing" as the compiler only sees that you propagate a constant for f=0
before, it can assume that f==0
will always be true and optimize it away.
And for all of that you have to tell the compiler that there will be something happen which is not visible from the given flow of code. That can be something like a call to some semaphore/mutex/... or other IPC functionality or the use of atomic
vars.
If you compile your code, I assume you get more or less "nothing" as for each of both code parts nothing has any effect and the compiler did not see that the variables are used from two thread context and optimize all and everything away.
If we implement the code as the following example, we see it fails and print 0
on my system.
QUESTION
I have a Largelist data
with 300 names having names and a data frame data6
with values.
Largelist data
looks like below:
ANSWER
Answered 2022-Feb-03 at 16:22We may use lapply/Map
to loop over the columns of 'data', apply the function and then cbind
the list
elements
QUESTION
I like to test my site in a lot of threads. But when I trying to do that, I see one problem. All motions when I like are happening in last opened window. So, the first window just stuck in background.
...ANSWER
Answered 2022-Jan-21 at 20:25This is because of you using static field for Forefox driver.
Static means the one per all instances. So remove static
here.
QUESTION
I'm trying to find the correct value for the KMACHINE
setting, defined as "The machine as known by the kernel."
When I manually configure the kernel (outside of Yocto) I do not enter a machine type. I do set ARCH=arm
, choose a "system type" config option like CONFIG_ARCH_LPC32XX=y
, or load a defconfig like lpc32xx_defconfig
but I don't know if any of those is what KMACHINE
is supposed to be.
As an example, the Yocto documentation gives intel-core2-32
which does not appear anywhere the Linux 5.15 sources.
ANSWER
Answered 2022-Jan-18 at 07:53KMACHINE
is used to select Yocto-specific metadata for building the kernel, and is not passed to the kernel build system. By default, it is set to ${MACHINE}
in kernel-yocto.bbclass
, and can be overridden if a machine does not need its own metadata selection, and can instead use an existing metadata.
There's a better description under LINUX_KERNEL_TYPE
in the manual (paraphrased):
The KMACHINE and LINUX_KERNEL_TYPE variables define the search arguments used by Yocto's kernel tools to find the appropriate description within Yocto's kernel metadata with which to build out the kernel sources and configuration.
This kernel metadata is maintained by the Yocto Project, in the yocto-kernel-cache
repository. It is optional, and is only used if the selected kernel recipe is a "linux-yocto" style recipe (i.e. it inherits linux-yocto.inc
).
If you're using an out-of-kernel-tree defconfig
to configure your kernel, it's unlikely you'll need Yocto's kernel metadata, and therefore don't need to override KMACHINE
.
QUESTION
#include
#include
#include
#include
#include
using namespace std;
static inline void stick_this_thread_to_core(int core_id);
static inline void* incrementLoop(void* arg);
struct BenchmarkData {
long long iteration_count;
int core_id;
};
pthread_barrier_t g_barrier;
int main(int argc, char** argv)
{
if(argc != 3) {
cout << "Usage: ./a.out " << endl;
return EXIT_FAILURE;
}
cout << "================================================ STARTING ================================================" << endl;
int core1 = std::stoi(argv[1]);
int core2 = std::stoi(argv[2]);
pthread_barrier_init(&g_barrier, nullptr, 2);
const long long iteration_count = 100'000'000'000;
BenchmarkData benchmark_data1{iteration_count, core1};
BenchmarkData benchmark_data2{iteration_count, core2};
pthread_t worker1, worker2;
pthread_create(&worker1, nullptr, incrementLoop, static_cast(&benchmark_data1));
cout << "Created worker1" << endl;
pthread_create(&worker2, nullptr, incrementLoop, static_cast(&benchmark_data2));
cout << "Created worker2" << endl;
pthread_join(worker1, nullptr);
cout << "Joined worker1" << endl;
pthread_join(worker2, nullptr);
cout << "Joined worker2" << endl;
return EXIT_SUCCESS;
}
static inline void stick_this_thread_to_core(int core_id) {
int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
if (core_id < 0 || core_id >= num_cores) {
cerr << "Core " << core_id << " is out of assignable range.\n";
return;
}
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_t current_thread = pthread_self();
int res = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
if(res == 0) {
cout << "Thread bound to core " << core_id << " successfully." << endl;
} else {
cerr << "Error in binding this thread to core " << core_id << '\n';
}
}
static inline void* incrementLoop(void* arg)
{
BenchmarkData* arg_ = static_cast(arg);
int core_id = arg_->core_id;
long long iteration_count = arg_->iteration_count;
stick_this_thread_to_core(core_id);
cout << "Thread bound to core " << core_id << " will now wait for the barrier." << endl;
pthread_barrier_wait(&g_barrier);
cout << "Thread bound to core " << core_id << " is done waiting for the barrier." << endl;
long long data = 0;
long long i;
cout << "Thread bound to core " << core_id << " will now increment private data " << iteration_count / 1'000'000'000.0 << " billion times." << endl;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(i = 0; i < iteration_count; ++i) {
++data;
__asm__ volatile("": : :"memory");
}
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
unsigned long long elapsed_time = std::chrono::duration_cast(end - begin).count();
cout << "Elapsed time: " << elapsed_time << " ms, core: " << core_id << ", iteration_count: " << iteration_count << ", data value: " << data << ", i: " << i << endl;
return nullptr;
}
...ANSWER
Answered 2022-Jan-13 at 08:40It turns out that cores 0, 16, 17 were running at much higher frequency on my Skylake server.
QUESTION
Yocto 3.4 failed at the glibc 2.34 compilation.
I think that the error happening at the linking stage:
...ANSWER
Answered 2021-Dec-09 at 08:55I tried with the definition in dso_handle.h as below:
QUESTION
I try to build Alexa Auto SDK https://github.com/alexa/alexa-auto-sdk/blob/3.2/builder/README.md
and I use an Apple Silicon M1, installed Docker successfully but
sadly I run now with
./builder/build.sh android -t androidx86-64 --android-api 28
into
...ANSWER
Answered 2021-Nov-20 at 08:33I don't know if this will solve your problem, but I was facing a similar issue building the auto-sdk for android on MAC OS machines (intel based silicon) We were able to solve the problem by increasing the docker default ram usage (set to 2gb). https://docs.docker.com/desktop/mac/ After increasing to 6gb it worked perfectly.
QUESTION
I am trying to build the openblas.bb in a yocto project but it fails.
machine is "qemux86-64"
ANSWER
Answered 2021-Nov-02 at 15:25I changed to use the newest version of the openblas:
QUESTION
I'm on a mac m1 machine. I'm using RStudio in Anaconda and I wanted to update the R packages with the update button. However, I got the same error for many of the packages when I tried to update. Here is one example:
...ANSWER
Answered 2021-Sep-15 at 22:16It is simpler to avoid using install.packages()
when using an R environment managed by Conda, especially when the package involves compilation. Instead, prefer using Conda for installation. In this particular case, use
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install core2
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page