multicore | Parallel processing of R code | GPU library
kandi X-RAY | multicore Summary
kandi X-RAY | multicore Summary
Parallel processing of R code on machines with multiple cores or CPUs
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of multicore
multicore Key Features
multicore Examples and Code Snippets
import skimage.data
import imgaug as ia
import imgaug.augmenters as iaa
from imgaug.augmentables.batches import UnnormalizedBatch
# Number of batches and batch size for this example
nb_batches = 10
batch_size = 32
# Example augmentation sequence to
Community Discussions
Trending Discussions on multicore
QUESTION
Modern multicore CPUs synchronize cache between cores by snooping, i.e. each core broadcasts what it is doing in terms of memory access, and watches the broadcasts generated by other cores, to cooperate in making sure writes from core A are seen by core B.
This is good in that if you have data that really does need to be shared between threads, it minimizes the amount of code you have to write to make sure it does get shared.
It's bad in that if you have data that should be local to just one thread, the snooping still happens, constantly dissipating energy to no purpose.
Does the snooping still happens if you declare the relevant variables thread_local
? Unfortunately the answer is yes according to the accepted answer to Can other threads modify thread-local memory?
Does any currently extant platform (combination of CPU and operating system) provide any way to turn off snooping for thread-local data? Doesn't have to be a portable way; if it requires issuing OS-specific API calls, or even dropping into assembly, I'm still interested.
...ANSWER
Answered 2021-Mar-24 at 01:57There is a basic invalidation based protocol, MESI, which is somewhat foundational. There are other extensions of it, but it serves to minimize the number of bus transactions on a read or write. MESI encodes the states a cache line can be in: Modified, Exclusive, Shared, Invalid. A basic schematic of MESI involves two views. The dashes(-) means maybe an internal state change, but no external operation required. From the CPU to its cache:
QUESTION
I've being trying to find an answer for that, and all I could find it that once a thread reaches a critical section it locks it in front of other threads (or some other lock mechanism is being used to lock the critical section).
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Can I simply assume the the program will malfunction?
Note: I am referencing to a multicore CPUs.
Thanks.
...ANSWER
Answered 2021-Mar-21 at 11:44I think you are missing the point of the fundamental locking primitives like Semaphores. If correct primitive is used, and used correctly, then the timing of the threads do not matter. They may well be simultaneous. The Operating System guarantees that no two thread will enter the critical section. Even on multicore machines, this bit is specially implemented (with lots of trickery even) to get that assurance.
To address your concerns specifically:
But that implies that the threads didn't really reach the CS exactly at the same microsecond.
No. The other threads could have reached in the same microsecond, BUT if the locking mechanism is correct, then only one the competing threads will "enter" the critical section and others will wait.
Although I guess it is quite rare, can it really happen, and what happens in this situation?
Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.
Can I simply assume the the program will malfunction?
Ideally the program should not malfunction. But any code will have bugs - so does your code and the Operating System code for the Semaphores. So it is safe to assume that in some edge cases the program will indeed malfunction. But this assumption is true for any code in general.
Locking and Critical Sections are rather tricky to correctly implement. So for non academic purposes we should always use the system provided locking primitives. All Operating Systems expose stuff like Semaphores which most programming languages have ways to use. Some programming languages have their own lightweight implementations which provide somewhat softer guarantees but at a higher performance. As I said, while doing Critical Sections, it is critical to choose the correct thing and also to implement it correctly.
QUESTION
I want to get the accuracy measures
for 9 different subseriesas described in my
R loop` below with the following steps:
- Simulate a 10 sample of
AR(1)
series. - Split the series into subseries of size
1, 2, 3, 4, 5, 6, 7, 8, 9
without overlapping. - Resample the subseries 1000 times each block size with replacement.
- Form a new series by joining all the resampled subseries for each block size.
- Check the
accuracy
(ME, RMSE, MAE, MPE, MAPE) of the newly formed NINE(9) SUB-series for each block size.
as follows:
...ANSWER
Answered 2021-Mar-19 at 07:26ACCURACY
is a 1 row matrix:
QUESTION
Using Visual Studio 2019...
I have som huge projects and it takes a long time to build them.
I want to buy a new CPU, should I go with a fast single core CPU (like the Intel Core i9-11900K) or should I chose a fast multicore CPU (like AMD Threadripper 3960X)?
Does VS2019 take advantage of multicore CPU when building/runinning projects?
Thanks.
...ANSWER
Answered 2021-Mar-10 at 02:58As you wished, VS IDE does support multi-core build process.
VS first will get a basic performance evaluation based on your current CPU hardware. Then, you should open the switch under Tools-->Options-->Projects and Solutions-->Build and Run-->and you will see maximum number of parallel project builds.
Set the number of build process based on your CPU performance.
Obviously, it is better to use multi-core build.
Note: value 1
means single-core and you should expand the value to enable multi-core build.
If you build c++ projects, there is another second option under Tools-->Options-->Projects and Solutions-->VC++ Project Settings-->Maximum concurrent c++ compilation.
In this suitation, value 0
means all CPU will be used.
QUESTION
See the edit at the end for a reproducible example.
Problem description
When I run boot::censboot(data, statistic, parallel = "multicore", ncpus = 2, var = whatEver)
, where I've defined statistic <- function(data, var)
, I get error messages of type FUN(X[[i]], ...) : unused argument (var = whatEver)
. The issue is that statistic
is not able to see the value of var
.
This does not happen when I call boot::censboot(data, statistic, parallel = "no")
.
By debugging I can see that:
If
...parallel = "no"
,boot::censboot
is running something like this:
ANSWER
Answered 2021-Feb-04 at 11:26This is a rewrite of the original post, that gives a better explanation of what went wrong, and fixes a possible bug in the workaround.
That looks like a bug in censboot
. It doesn't handle the ...
parameter correctly. (More explanation below.) The reason you don't get an error with parallel = 'no'
is that the code follows a different path.
A workaround is to use "partial application" to create a 1-parameter statistic function, like this:
QUESTION
I' was reading a book about assembly(intermediate level) and it mentioned that some instructions like xchg
automatically assert the processor LOCK# signal. Searching online about it revealed that it give the processor the exclusive right of any shared memory and no specific details. Which made me wonder how does this right works.
- Does this mean that any other computer device like GPU or something else can't have access to memory for example. Actually can other devices talk directly to RAM without passing first on the CPU.
- How does the processor know that it's in this locked state is it saved in a control or rflags register for example or what since I can't see how this operation works when having multicore CPU.
- The websites I visited said lock any shared memory. does this mean that during this lock period the whole RAM is locked or just the memory page(or part of memory not all of it) that the instruction is performed on.
ANSWER
Answered 2021-Jan-12 at 09:42The basic problem is that some instructions read memory, modify the value read, then write a new value; and if the contents of memory change between the read and the write, (some) parallel code can end up in an inconsistent state.
A nice example is one CPU doing inc dword [foo]
while another CPU does dec dword [foo]
. After both instructions (on both CPUs) are executed the value should be the same as it originally was; but both CPUs could read the old value, then both CPUs could modify it, then both CPUs could write their new value; resulting in the value being 1 higher or 1 lower than you'd expect.
The solution was to use a #lock
signal to prevent anything else from accessing the same piece of memory at the same time. E.g. the first CPU would assert #lock
then do it's read/modify/write, then de-assert #lock
; and anything else would see that the #lock
is asserted and have to wait until the #lock
is de-asserted before it can do any memory access. In other words, it's a simple form of mutual exclusion (like a spinlock, but in hardware).
Of course "everything else has to wait" has a performance cost; so it's mostly only done when explicitly requested by software (e.g. lock inc dword [foo]
and not inc dword [foo]
) but there are a few cases where it's done implicitly - xchg
instruction when an operand uses memory, and updates to dirty/accessed/busy flags in some of the tables the CPU uses (for paging, and GDT/LDT/IDT entries). Also; later (Pentium Pro I think?), the behavior was optimized to work with cache coherency protocol so that the #lock
isn't asserted if the cache line can be put in the exclusive state instead.
Note: In the past there have been 2 CPU bugs (Intel Pentium "0xF00F" bug and Cyrix "Coma" bug) where a CPU can be tricked into asserting the #lock
signal and never de-asserting it; causing the entire system to lock up because nothing can access any memory.
- Does this mean that any other computer device like GPU or something else can't have access to memory for example. Actually can other devices talk directly to RAM without passing first on the CPU.
Yes. If the #lock
is asserted (which doesn't include cases where newer CPUs can put the cache line into the exclusive state instead); anything that accesses memory would have to wait for #lock
to be de-asserted.
Note: Most modern devices can/do access memory directly (to transfer data to/from RAM without using the CPU to transfer data).
- How does the processor know that it's in this locked state is it saved in a control or rflags register for example or what since I can't see how this operation works when having multicore CPU.
It's not saved in the contents of any register. It's literally an electronic signal on a bus or link. For an extremely over-simplified example; assume that the bus has 32 "address" wires, 32 "data" wires, plus a #lock
wire; where "assert the #lock
" means that the voltage on that #lock
wire goes from 0 volts up to 3.3 volts. When anything wants to read or write memory (before attempting to change the voltages on the "address" wires or "data" wires) it checks the voltage on the #lock
wire is 0 volts.
Note: A real bus is much more complicated and needs a few other signals (e.g. for direction of transfer, for collision avoidance, for "I/O port or physical memory", etc); and modern buses use serial lanes and not parallel wires; and modern systems use "point to point links" and not "common bus shared by all the things".
- The websites I visited said lock any shared memory. does this mean that during this lock period the whole RAM is locked or just the memory page(or part of memory not all of it) that the instruction is performed on.
It's better to say that the bus is locked; where everything has to use the bus to access memory (and nothing else can use the bus when the bus is locked, even when something else is trying to use the bus for something that has nothing to do with memory - e.g. to send an IRQ to a CPU).
Of course (due to aggressive performance optimizations - primarily the "if the cache line can be put in the exclusive state instead" optimization) it's even better to say that the hardware can do anything it feels like as long as the result behaves as if there's a shared bus that was locked (even if there isn't a shared bus and nothing was actually locked).
Note: 80x86 supports misaligned accesses (e.g. you can lock inc dword [address]
where the access can straddle a boundary), where if a memory access does straddle a boundary the CPU needs to combine 2 or more pieces (e.g. a few bytes from the end of one cache line and a few bytes from the start of the next cache line). Modern virtual memory means that if the virtual address straddles a page boundary the CPU needs to access 2 different virtual pages which may have "extremely unrelated" physical addresses. If a theoretical CPU tried to implement independent locks (a different lock for each memory area) then it would also need to support asserting multiple lock signals. This can cause deadlocks - e.g. one CPU locks "memory page 1" then tries to lock "memory page 2" (and can't because it's locked); while another CPU locks "memory page 2" then tries to lock "memory page 1" (and can't because it's locked). To fix that the theoretical CPU would have to use "global lock ordering" - always assert locks in a specific order. The end result would be a significant amount of complexity (where it's likely that the added complexity costs more performance than it saves).
QUESTION
I have a python dataframe "a,b,c,d,...z". And I want to get all possible combinations: "aa, ab, ac, ad,.. az" then "ba, bb, bc, bd,... bz" and so on.
What I have done is a simple nested for
...ANSWER
Answered 2020-Nov-29 at 21:03I think what you are looking for is combinations
from itertools
, a package from the standard library.
QUESTION
I have a possibility to do processing on a HPC environment, where task managing and resource allocation are controlled by SLURM batch job system. However, I have not found the right configurations how to utilize the allocated resources within R efficiently. I have tried to allocate 20 CPU's to one task in SLURM, using plan(multicore) -function of future package in R. After running test runs with different count of CPUs allocated, efficiency statistics suggested that with these settings only one of the CPU's allocated was used during the test runs.
Slurm bash script presented below
...ANSWER
Answered 2020-Nov-17 at 08:24Issue was found by HPC service provider. For unknown reason OMP_PLACES=cores
variable which should tie threads/processes to specific cores, appeared to bind all processes to a single core only when running multi-core R jobs. Issue has been solved by rebuilding r-environment singularity-container.
QUESTION
I have the following code that prints to my serial console in mongoose os. I can see the device in amazon aws, and in mdash as online. I get back a 0 from the pub and it never sends a message to aws mqtt I have subscribed to the topic in the test seciton in mqtt so not sure what I am doing wrong
...ANSWER
Answered 2020-Nov-11 at 00:17Ok it is working again, I needed to delete the device in amazon, and run the mos aws setup command again.
So the real problem was I was flashing my device each time I updated my code and it would overwrite my wifi settings, my aws settings and my mdash settings.
Thanks to help from a different source this solved it for me
QUESTION
I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction. Some of these require significant computation time so I tried to get the advantage of my multicore CPU.
Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:
...ANSWER
Answered 2020-Nov-01 at 15:44As we found out it is different on windows and ubuntu. It is probably because of this:
spawn The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.
Available on Unix and Windows. The default on Windows and macOS.
fork The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
Available on Unix only. The default on Unix.
Try adding this line to your code:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install multicore
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page