multicore | Parallel processing of R code | GPU library

by s-u C Version: Current License: No License

X-Ray Key Features Code Snippets(1)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | multicore Summary

multicore is a C library typically used in Hardware, GPU applications. multicore has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Parallel processing of R code on machines with multiple cores or CPUs

Support

Quality

Security

License

Reuse

Support

multicore has a low active ecosystem.

It has 8 star(s) with 1 fork(s). There are 3 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 0 have been closed. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of multicore is current.

Quality

multicore has no bugs reported.

Security

multicore has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

multicore does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

multicore releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of multicore

Get all kandi verified functions for this library.

multicore Key Features

No Key Features are available at this moment for multicore.

multicore Examples and Code Snippets

Example: Multicore Augmentation

pypi

Lines of Code : 29

License : No License

Copy

import skimage.data
import imgaug as ia
import imgaug.augmenters as iaa
from imgaug.augmentables.batches import UnnormalizedBatch

# Number of batches and batch size for this example
nb_batches = 10
batch_size = 32

# Example augmentation sequence to

Community Discussions

Trending Discussions on multicore

Eliding cache snooping for thread-local memory

What happens when multiple threads try to access a critical section exactly at the same time?

How to Get Accuracy Measures for Nine(9) Sub-series in a loop

Building projects in VS2019, which is fastest - single core or multi core cpu?

lapply - The difference between passing and not passing arguments when named argument exists in the environment

What is processor Lock# signal and how it works?

Efficient combination of dataframe rows with itself

Parallelization and resource allocation with SLURM bash script

Using amazon mqtt to publish a button press from a esp8266

Python: parallel execution of a function which has a sequential loop inside

QUESTION

Eliding cache snooping for thread-local memory

Asked 2021-Mar-26 at 13:38

Modern multicore CPUs synchronize cache between cores by snooping, i.e. each core broadcasts what it is doing in terms of memory access, and watches the broadcasts generated by other cores, to cooperate in making sure writes from core A are seen by core B.

This is good in that if you have data that really does need to be shared between threads, it minimizes the amount of code you have to write to make sure it does get shared.

It's bad in that if you have data that should be local to just one thread, the snooping still happens, constantly dissipating energy to no purpose.

Does the snooping still happens if you declare the relevant variables thread_local? Unfortunately the answer is yes according to the accepted answer to Can other threads modify thread-local memory?

Does any currently extant platform (combination of CPU and operating system) provide any way to turn off snooping for thread-local data? Doesn't have to be a portable way; if it requires issuing OS-specific API calls, or even dropping into assembly, I'm still interested.

...

ANSWER

Answered 2021-Mar-24 at 01:57

There is a basic invalidation based protocol, MESI, which is somewhat foundational. There are other extensions of it, but it serves to minimize the number of bus transactions on a read or write. MESI encodes the states a cache line can be in: Modified, Exclusive, Shared, Invalid. A basic schematic of MESI involves two views. The dashes(-) means maybe an internal state change, but no external operation required. From the CPU to its cache:

Source https://stackoverflow.com/questions/66773546

QUESTION

What happens when multiple threads try to access a critical section exactly at the same time?

Asked 2021-Mar-22 at 07:02

I've being trying to find an answer for that, and all I could find it that once a thread reaches a critical section it locks it in front of other threads (or some other lock mechanism is being used to lock the critical section).

But that implies that the threads didn't really reach the CS exactly at the same microsecond.

Although I guess it is quite rare, can it really happen, and what happens in this situation?

Can I simply assume the the program will malfunction?

Note: I am referencing to a multicore CPUs.

Thanks.

...

ANSWER

Answered 2021-Mar-21 at 11:44

I think you are missing the point of the fundamental locking primitives like Semaphores. If correct primitive is used, and used correctly, then the timing of the threads do not matter. They may well be simultaneous. The Operating System guarantees that no two thread will enter the critical section. Even on multicore machines, this bit is specially implemented (with lots of trickery even) to get that assurance.

To address your concerns specifically:

But that implies that the threads didn't really reach the CS exactly at the same microsecond.

No. The other threads could have reached in the same microsecond, BUT if the locking mechanism is correct, then only one the competing threads will "enter" the critical section and others will wait.

Although I guess it is quite rare, can it really happen, and what happens in this situation?

Rare or not, if the correct locking primitive is used, and used correctly, then no two threads will enter the critical section.

Can I simply assume the the program will malfunction?

Ideally the program should not malfunction. But any code will have bugs - so does your code and the Operating System code for the Semaphores. So it is safe to assume that in some edge cases the program will indeed malfunction. But this assumption is true for any code in general.

Locking and Critical Sections are rather tricky to correctly implement. So for non academic purposes we should always use the system provided locking primitives. All Operating Systems expose stuff like Semaphores which most programming languages have ways to use. Some programming languages have their own lightweight implementations which provide somewhat softer guarantees but at a higher performance. As I said, while doing Critical Sections, it is critical to choose the correct thing and also to implement it correctly.

Source https://stackoverflow.com/questions/66731695

QUESTION

How to Get Accuracy Measures for Nine(9) Sub-series in a loop

Asked 2021-Mar-19 at 07:26

I want to get the accuracy measures for 9 different subseriesas described in myR loop` below with the following steps:

Simulate a 10 sample of AR(1) series.
Split the series into subseries of size 1, 2, 3, 4, 5, 6, 7, 8, 9 without overlapping.
Resample the subseries 1000 times each block size with replacement.
Form a new series by joining all the resampled subseries for each block size.
Check the accuracy (ME, RMSE, MAE, MPE, MAPE) of the newly formed NINE(9) SUB-series for each block size.

as follows:

...

ANSWER

Answered 2021-Mar-19 at 07:26

ACCURACY is a 1 row matrix:

Source https://stackoverflow.com/questions/66634276

QUESTION

Building projects in VS2019, which is fastest - single core or multi core cpu?

Asked 2021-Mar-10 at 02:58

Using Visual Studio 2019...

I have som huge projects and it takes a long time to build them.

I want to buy a new CPU, should I go with a fast single core CPU (like the Intel Core i9-11900K) or should I chose a fast multicore CPU (like AMD Threadripper 3960X)?

Does VS2019 take advantage of multicore CPU when building/runinning projects?

Thanks.

...

ANSWER

Answered 2021-Mar-10 at 02:58

As you wished, VS IDE does support multi-core build process.

VS first will get a basic performance evaluation based on your current CPU hardware. Then, you should open the switch under Tools-->Options-->Projects and Solutions-->Build and Run-->and you will see maximum number of parallel project builds.

Set the number of build process based on your CPU performance.

Obviously, it is better to use multi-core build.

Note: value 1 means single-core and you should expand the value to enable multi-core build.

If you build c++ projects, there is another second option under Tools-->Options-->Projects and Solutions-->VC++ Project Settings-->Maximum concurrent c++ compilation.

In this suitation, value 0 means all CPU will be used.

Source https://stackoverflow.com/questions/66546265

QUESTION

lapply - The difference between passing and not passing arguments when named argument exists in the environment

Asked 2021-Feb-04 at 11:26

See the edit at the end for a reproducible example.

Problem description

When I run boot::censboot(data, statistic, parallel = "multicore", ncpus = 2, var = whatEver), where I've defined statistic <- function(data, var), I get error messages of type FUN(X[[i]], ...) : unused argument (var = whatEver). The issue is that statistic is not able to see the value of var.

This does not happen when I call boot::censboot(data, statistic, parallel = "no").

By debugging I can see that:

If parallel = "no", boot::censboot is running something like this:
...

ANSWER

Answered 2021-Feb-04 at 11:26

This is a rewrite of the original post, that gives a better explanation of what went wrong, and fixes a possible bug in the workaround.

That looks like a bug in censboot. It doesn't handle the ... parameter correctly. (More explanation below.) The reason you don't get an error with parallel = 'no' is that the code follows a different path.

A workaround is to use "partial application" to create a 1-parameter statistic function, like this:

Source https://stackoverflow.com/questions/66030852

QUESTION

What is processor Lock# signal and how it works?

Asked 2021-Jan-12 at 09:42

I' was reading a book about assembly(intermediate level) and it mentioned that some instructions like xchg automatically assert the processor LOCK# signal. Searching online about it revealed that it give the processor the exclusive right of any shared memory and no specific details. Which made me wonder how does this right works.

Does this mean that any other computer device like GPU or something else can't have access to memory for example. Actually can other devices talk directly to RAM without passing first on the CPU.
How does the processor know that it's in this locked state is it saved in a control or rflags register for example or what since I can't see how this operation works when having multicore CPU.
The websites I visited said lock any shared memory. does this mean that during this lock period the whole RAM is locked or just the memory page(or part of memory not all of it) that the instruction is performed on.

...

ANSWER

Answered 2021-Jan-12 at 09:42

The basic problem is that some instructions read memory, modify the value read, then write a new value; and if the contents of memory change between the read and the write, (some) parallel code can end up in an inconsistent state.

A nice example is one CPU doing inc dword [foo] while another CPU does dec dword [foo]. After both instructions (on both CPUs) are executed the value should be the same as it originally was; but both CPUs could read the old value, then both CPUs could modify it, then both CPUs could write their new value; resulting in the value being 1 higher or 1 lower than you'd expect.

The solution was to use a #lock signal to prevent anything else from accessing the same piece of memory at the same time. E.g. the first CPU would assert #lock then do it's read/modify/write, then de-assert #lock; and anything else would see that the #lock is asserted and have to wait until the #lock is de-asserted before it can do any memory access. In other words, it's a simple form of mutual exclusion (like a spinlock, but in hardware).

Of course "everything else has to wait" has a performance cost; so it's mostly only done when explicitly requested by software (e.g. lock inc dword [foo] and not inc dword [foo]) but there are a few cases where it's done implicitly - xchg instruction when an operand uses memory, and updates to dirty/accessed/busy flags in some of the tables the CPU uses (for paging, and GDT/LDT/IDT entries). Also; later (Pentium Pro I think?), the behavior was optimized to work with cache coherency protocol so that the #lock isn't asserted if the cache line can be put in the exclusive state instead.

Note: In the past there have been 2 CPU bugs (Intel Pentium "0xF00F" bug and Cyrix "Coma" bug) where a CPU can be tricked into asserting the #lock signal and never de-asserting it; causing the entire system to lock up because nothing can access any memory.

Does this mean that any other computer device like GPU or something else can't have access to memory for example. Actually can other devices talk directly to RAM without passing first on the CPU.

Yes. If the #lock is asserted (which doesn't include cases where newer CPUs can put the cache line into the exclusive state instead); anything that accesses memory would have to wait for #lock to be de-asserted.

Note: Most modern devices can/do access memory directly (to transfer data to/from RAM without using the CPU to transfer data).

How does the processor know that it's in this locked state is it saved in a control or rflags register for example or what since I can't see how this operation works when having multicore CPU.

It's not saved in the contents of any register. It's literally an electronic signal on a bus or link. For an extremely over-simplified example; assume that the bus has 32 "address" wires, 32 "data" wires, plus a #lock wire; where "assert the #lock" means that the voltage on that #lock wire goes from 0 volts up to 3.3 volts. When anything wants to read or write memory (before attempting to change the voltages on the "address" wires or "data" wires) it checks the voltage on the #lock wire is 0 volts.

Note: A real bus is much more complicated and needs a few other signals (e.g. for direction of transfer, for collision avoidance, for "I/O port or physical memory", etc); and modern buses use serial lanes and not parallel wires; and modern systems use "point to point links" and not "common bus shared by all the things".

The websites I visited said lock any shared memory. does this mean that during this lock period the whole RAM is locked or just the memory page(or part of memory not all of it) that the instruction is performed on.

It's better to say that the bus is locked; where everything has to use the bus to access memory (and nothing else can use the bus when the bus is locked, even when something else is trying to use the bus for something that has nothing to do with memory - e.g. to send an IRQ to a CPU).

Of course (due to aggressive performance optimizations - primarily the "if the cache line can be put in the exclusive state instead" optimization) it's even better to say that the hardware can do anything it feels like as long as the result behaves as if there's a shared bus that was locked (even if there isn't a shared bus and nothing was actually locked).

Note: 80x86 supports misaligned accesses (e.g. you can lock inc dword [address] where the access can straddle a boundary), where if a memory access does straddle a boundary the CPU needs to combine 2 or more pieces (e.g. a few bytes from the end of one cache line and a few bytes from the start of the next cache line). Modern virtual memory means that if the virtual address straddles a page boundary the CPU needs to access 2 different virtual pages which may have "extremely unrelated" physical addresses. If a theoretical CPU tried to implement independent locks (a different lock for each memory area) then it would also need to support asserting multiple lock signals. This can cause deadlocks - e.g. one CPU locks "memory page 1" then tries to lock "memory page 2" (and can't because it's locked); while another CPU locks "memory page 2" then tries to lock "memory page 1" (and can't because it's locked). To fix that the theoretical CPU would have to use "global lock ordering" - always assert locks in a specific order. The end result would be a significant amount of complexity (where it's likely that the added complexity costs more performance than it saves).

Source https://stackoverflow.com/questions/65678556

QUESTION

Efficient combination of dataframe rows with itself

Asked 2020-Nov-29 at 21:03

I have a python dataframe "a,b,c,d,...z". And I want to get all possible combinations: "aa, ab, ac, ad,.. az" then "ba, bb, bc, bd,... bz" and so on.

What I have done is a simple nested for

...

ANSWER

Answered 2020-Nov-29 at 21:03

I think what you are looking for is combinations from itertools, a package from the standard library.

Source https://stackoverflow.com/questions/65065118

QUESTION

Parallelization and resource allocation with SLURM bash script

Asked 2020-Nov-17 at 08:24

I have a possibility to do processing on a HPC environment, where task managing and resource allocation are controlled by SLURM batch job system. However, I have not found the right configurations how to utilize the allocated resources within R efficiently. I have tried to allocate 20 CPU's to one task in SLURM, using plan(multicore) -function of future package in R. After running test runs with different count of CPUs allocated, efficiency statistics suggested that with these settings only one of the CPU's allocated was used during the test runs.

Slurm bash script presented below

...

ANSWER

Answered 2020-Nov-17 at 08:24

Issue was found by HPC service provider. For unknown reason OMP_PLACES=cores variable which should tie threads/processes to specific cores, appeared to bind all processes to a single core only when running multi-core R jobs. Issue has been solved by rebuilding r-environment singularity-container.

Source https://stackoverflow.com/questions/64536111

QUESTION

Using amazon mqtt to publish a button press from a esp8266

Asked 2020-Nov-11 at 00:17

I have the following code that prints to my serial console in mongoose os. I can see the device in amazon aws, and in mdash as online. I get back a 0 from the pub and it never sends a message to aws mqtt I have subscribed to the topic in the test seciton in mqtt so not sure what I am doing wrong

...

ANSWER

Answered 2020-Nov-11 at 00:17

Ok it is working again, I needed to delete the device in amazon, and run the mos aws setup command again.

So the real problem was I was flashing my device each time I updated my code and it would overwrite my wifi settings, my aws settings and my mdash settings.

Thanks to help from a different source this solved it for me

Source https://stackoverflow.com/questions/64732769

QUESTION

Python: parallel execution of a function which has a sequential loop inside

Asked 2020-Nov-01 at 16:03

I am reproducing some simple 10-arm bandit experiments from Sutton and Barto's book Reinforcement Learning: An Introduction. Some of these require significant computation time so I tried to get the advantage of my multicore CPU.

Here is the function which i need to run 2000 times. It has 1000 sequential steps which incrementally improve the reward:

...

ANSWER

Answered 2020-Nov-01 at 15:44

As we found out it is different on windows and ubuntu. It is probably because of this:

spawn The parent process starts a fresh python interpreter process. The child process will only inherit those resources necessary to run the process objects run() method. In particular, unnecessary file descriptors and handles from the parent process will not be inherited. Starting a process using this method is rather slow compared to using fork or forkserver.

Available on Unix and Windows. The default on Windows and macOS.

fork The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.

Available on Unix only. The default on Unix.

Try adding this line to your code:

Source https://stackoverflow.com/questions/64633050

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install multicore

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: