bottleneck | Fast NumPy array functions written in C | Functional Programming library
kandi X-RAY | bottleneck Summary
kandi X-RAY | bottleneck Summary
Fast NumPy array functions written in C
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of bottleneck
bottleneck Key Features
bottleneck Examples and Code Snippets
def shared_embedding_columns_v2(categorical_columns,
dimension,
combiner='mean',
initializer=None,
shared_embedding_collec
def embedding_column_v2(categorical_column,
dimension,
combiner='mean',
initializer=None,
max_sequence_length=0,
learning_rate_fn=
Community Discussions
Trending Discussions on bottleneck
QUESTION
I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.
Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.
After turning on optimization (-O3 -march=native -ffast-math
, gcc cross compiler), the Gprof result was the following:
Clearly, the convolution2D
function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.
The convolution function in question looks like this:
...ANSWER
Answered 2022-Mar-10 at 13:57Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.
So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.
I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).
The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.
QUESTION
I have a numerical routine that I need to run to solve a certain equation, which contains a few nested four loops. I initially wrote this routine into Python, using numba.jit to achieve an acceptable performance. For large system sizes however, this method becomes quite slow, so I have been rewriting the routine into Fortran hoping to achieve a speed-up. However I have found that my Fortran version is much slower than the first version in Python, by a factor of 2-3.
I believe the bottleneck is a linear interpolation function that is called at each innermost loop. In the Python implementation I use numpy.interp
, which seems to be pretty fast when combined with numba.jit
. In Fortran I wrote my own interpolation function, which reads,
ANSWER
Answered 2022-Feb-06 at 15:42At a guess (and see @IanBush's comments if you want to enable us to do better than guessing), it's the line
QUESTION
I have a local python project called jive
that I would like to use in an another project. My current method of using jive
in other projects is to activate the conda env for the project, then move to my jive
directory and use python setup.py install
. This works fine, and when I use conda list
, I see everything installed in the env including jive
, with a note that jive
was installed using pip.
But what I really want is to do this with full conda. When I want to use jive
in another project, I want to just put jive
in that projects environment.yml
.
So I did the following:
- write a simple
meta.yaml
so I could use conda-build to buildjive
locally - build jive with
conda build .
- I looked at the tarball that was produced and it does indeed contain the
jive
source as expected - In my other project, add jive to the dependencies in
environment.yml
, and add 'local' to the list of channels. - create a conda env using that environment.yml.
When I activate the environment and use conda list
, it lists all the dependencies including jive
, as desired. But when I open python interpreter, I cannot import jive
, it says there is no such package. (If use python setup.py install
, I can import it.)
How can I fix the build/install so that this works?
Here is the meta.yaml, which lives in the jive
project top level directory:
ANSWER
Answered 2022-Feb-05 at 04:16The immediate error is that the build is generating a Python 3.10 version, but when testing Conda doesn't recognize any constraint on the Python version, and creates a Python 3.9 environment.
I think the main issue is that python >=3.5
is only a valid constraint when doing noarch
builds, which this is not. That is, once a package builds with a given Python version, the version must be constrained to exactly that version (up through minor). So, in this case, the package is built with Python 3.10, but it reports in its metadata that it is compatible with all versions of Python 3.5+, which simply isn't true because Conda Python packages install the modules into Python-version-specific site-packages
(e.g., lib/python-3.10/site-packages/jive
).
Typically, Python versions are controlled by either the --python
argument given to conda-build
or a matrix supplied by the conda_build_config.yaml
file (see documentation on "Build variants").
Try adjusting the meta.yaml
to something like
QUESTION
I've implemented a custom validation message on my input for the pattern
validation rule while leaving the default message for required
as is. However, when I do so, once the input becomes invalid, it never becomes valid again, even though I am meeting the pattern
criteria.
ANSWER
Answered 2022-Jan-03 at 16:18First of all, per MDN:
It's vital to set the message to an empty string if there are no errors. As long as the error message is not empty, the form will not pass validation and will not be submitted.
This agrees with that the HTML standard says:
Suffering from a custom error
When a control's custom validity error message (as set by the element's
setCustomValidity()
method orElementInternals
'ssetValidity()
method) is not the empty string.An element satisfies its constraints if it is not suffering from any of the above validity states.
Your sample does not clear the custom error if the form field is determined to be valid. As such, once the field is determined invalid, it stays so for the remainder of the session.
Moreover, you modify custom error only after the field has already been determined invalid. This means the form will still not be submitted even if you clear the message in the same handler.
A better way too accomplish your goal would be to monitor the field in the change
event handler for the field and set the custom message there:
QUESTION
Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:
Instruction B I0 I1 M L S F0 F1.LBB0_1:
ldr q3, [x1], #16
0.5
0.5
1
ldr q4, [x2], #16
0.5
0.5
1
add x8, x8, #4
0.5
0.5
cmp x8, #508
0.5
0.5
mul v5.4s, v3.4s, v4.4s
2
mul v5.4s, v5.4s, v0.4s
2
smull v6.2d, v5.2s, v1.2s
1
smull2 v5.2d, v5.4s, v2.4s
1
smlal v6.2d, v3.2s, v4.2s
1
smlal2 v5.2d, v3.4s, v4.4s
1
uzp2 v3.4s, v6.4s, v5.4s
1
str q3, [x0], #16
0.5
0.5
1
b.lo .LBB0_1
1
Total port pressure
1
2.5
2.5
0
2
1
8
1
Although uzp2
could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.
There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.
Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.
Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.
Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1
/ST1
instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32
) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1
/ST1
, improves performance to 11.25 cycles per iteration.
In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2
to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?
EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0
) directly. Any overhead should be canceled out by the measurement strategy above.
ANSWER
Answered 2021-Nov-06 at 13:50First off, you can further reduce the theoretical cycles to 6 by replacing the first mul
with uzp1
and doing the following smull
and smlal
the other way around: mul
, mul
, smull
, smlal
=> smull
, uzp1
, mul
, smlal
This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)
And you don't need v2
coefficents, but you can pack them to the higher part of v1
Let's rule out everything by unrolling this deep and writing it in assembly:
QUESTION
I've got a function func
that may cost ~50s when running on a single core. Now I want to run it on a server which has got 192-core CPUs for many times. But when I add worker processes to say, 180, the performance of each core slows down. The worst CPU takes ~100s to calculate func
.
Can someone help me, please?
Here is the pseudo code
...ANSWER
Answered 2021-Nov-17 at 17:18You are measuring the time it takes each worker to perform func()
and observe performance decrease for a single process when going from 10 processes to 180 parallel processes.
This looks quite normal to me:
- Intel cores use hyper-threading so you actually have 96 cores (in more detail - a hyper-threaded core adds only 20-30% performance). It means that 168 of your processes need to share 84 hyper-threaded cores and 12 processes get full 12 cores.
- The CPU speed is determined by throttle temperature (https://en.wikipedia.org/wiki/Thermal_design_power) and of course there is so much more space when running 10 processes vs 180 processes
- Your tasks are obviously competing for memory. They make a total of over 5TB of memory allocations and you machine has much less than that. The last mile in garbage collecting is always the most expensive one - so if your garbage collectors are squeezed and competing for memory the performance is uneven with surprisingly longer garbage collection times.
Looking at this data I would recommend you to try:
QUESTION
In a cross platform (Linux and windows) real-time application, I need the fastest way to share data between a C++ process and a python application that I both manage. I currently use sockets but it's too slow when using high-bandwith data (4K images at 30 fps).
I would ultimately want to use the multiprocessing shared memory but my first tries suggest it does not work. I create the shared memory in C++ using Boost.Interprocess and try to read it in python like this:
...ANSWER
Answered 2021-Sep-15 at 19:53So I spent the last days implementing shared memory using mmap, and the results are quite good in my opinion. Here are the benchmarks results comparing my two implementations: pure TCP and mix of TCP and shared memory.
Protocol:Benchmark consists of moving data from C++ to Python world (using python's numpy.nparray), then data sent back to C++ process. No further processing is involved, only serialization, deserialization and inter-process communication (IPC).
Case A:
- One C++ process implementing TCP communication using Boost.Asio
- One Python3 process using standard python TCP sockets
Communication is done with TCP {header + data}.
Case B:
- One C++ process implementing TCP communication using Boost.Asio and shared memory (mmap) using Boost.Interprocess
- One Python3 process using standard TCP sockets and mmap
Communication is hybrid : synchronization is done through sockets (only header is passed) and data is moved through shared memory. I think this design is great because I have suffered in the past from problem of synchronization using condition variable in shared memory, and TCP is easy to use in both C++ and Python environments.
Results: Small data at high frequency200 MBytes/s total: 10 MByte sample at 20 samples per second
Case Global CPU consumption C++ part python part A 17.5 % 10% 7.5% B 6% 1% 5% Big data at low frequency200 MBytes/s total: 0.2 MByte sample at 1000 samples per second
Case Global CPU consumption C++ part python part A 13.5 % 6.7% 6.8% B 11% 5.5% 5.5% Max bandwidth- A : 250 MBytes / second
- B : 600 MBytes / second
In my application, using mmap has a huge impact on big data at average frequency (almost 300 % performance gain). When using very high frequencies and small data, the benefit of shared memory is still there but not that impressive (only 20% improvement). Maximum throughput is more than 2 times bigger.
Using mmap is a good upgrade for me. I just wanted to share my results here.
QUESTION
Is there a faster way to do the following, where in the real application, df
has many rows (and therefore list_of_colnames
has the same number of elements):
ANSWER
Answered 2021-Nov-01 at 17:46One option may be to make use of row/column
indexing
QUESTION
I have the following class:
...ANSWER
Answered 2021-Nov-01 at 10:59Assuming B has a copy constructor, just add:
QUESTION
Let X be a set of distinct 64-bit unsigned integers std::uint64_t
, each one being interpreted as a bitset representing a subset of {1,2,...,64}.
I want a function to do the following: given a std::uint64_t
A, not necessarily in X, list all B in X, such that B is a subset of A, when A, B are interpreted as subsets of {1,2,...,64}.
(Of course, in C++ this condition is just (A & B) == B
).
Since A itself need not be in X, I believe that this is not a duplicate of other questions.
X will grow over time (but nothing will be deleted), although there will be far more queries than additions to X.
I am free to choose the data structure representing the elements of X.
Obviously, we could represent X as a std::set
or sorted std::vector
of std::uint64_t
, and I give one algorithm below. But can we do better?
What are good data structures for X and algorithms to do this efficiently? This should be a standard problem but I couldn't find anything.
EDIT: sorry if this is too vague. Obviously, if X were a std::set
we could search through all subsets of A, taking time O(2^m log |X|) with m <= N, or all elements of X in time O(|X| log |X|).
Assume that in most cases, the number of B is quite a bit smaller than both 2^m (the number of subsets of A) and |X|. So, we want some kind of algorithm to run in time much less than |X| or 2^m in such cases, ideally in time O(number of B) but that's surely too optimistic. Obviously, O(|X|) cannot be beaten in the worst case.
Obviously some memory overhead for X is expected, and memory is less of a bottleneck than time for me. Using memory roughly 10 * (the memory of X stored as a std::set
) is fine. Much more than this is too much. (Asymptotically, anything more than O(|X|) or O(|X| log |X|) memory is probably too much).
Obviously, the use of C++ is not essential: the algorithms/data structures are the important things here.
In the case that X is fixed, maybe Hasse diagrams could work.
It looks like Hasse diagrams would be quite time-consuming to construct each time X grows. (But still maybe worth a try if nothing else comes up). EDIT: maybe not so slow to update, so better than I thought.
The below is just my idea so far; maybe something better can be found?
FINAL edit: since it's closed, probably fairly - the "duplicate" question is pretty close - I won't bother with any further edits. I will probably do the below, but using a probabilistic skip list structure instead of a std::set
, and augmented with skip distances (so you can quickly calculate how many X elements remain in an interval, and thus reduce the number of search intervals, by switching to linear search when the intersection gets small). This is similar to Order Statistic Trees given in this question, but skip lists are a lot easier to reimplement than std::set
(especially as I don't need deletions).
Represent X as a std::set
or sorted std::vector
of 64-bit unsigned integers std::uint64_t
, using the ordinary numerical order, and do recursive searches within smaller and smaller intervals.
E.g., my query element is A = 10011010. Subsets of A containing the first bit lie in the inclusive interval [10000000, 10011010].
Subsets of A containing the second bit but not the first lie in the interval [00010000, 00011010].
Those with the third but not the second bit are in [00001000, 00001010].
Those with the fourth but not the third bit are in [00000010, 00000010].
Now, within the first interval [10000000, 10011010] you could make two subintervals to search, based on the second bit: [10000000, 10001010] and [10010000, 10011010].
Thus you can break it down recursively in this manner. The total length of search intervals is getting smaller all the time, so this is surely going to be better asymptotically than a trivial linear search through all of X.
E.g., if X = {00000010, 00001000, 00110111, 10011100} then only the first, third, fourth depth-1 intervals would have nonempty intersection with X. The final returned result would be [00000010, 00001000].
Of course this is unbalanced if the X elements are distributed fairly uniformly. We might want the search intervals to have roughly equal width at each depth, and they don't; above, the sizes of the four depth-1 search intervals are, I think, 27, 11, 3, 1, and for larger N the discrepancies could be much bigger.
If there are k bits in the query set A, then you'll have to construct k initial search intervals at depth 1 (searching on ONE bit), then up to 2k search intervals at depth 2, 4k at depth 3, etc.
If I've got it right, since log |X| = O(N) the number of search intervals is O(k + 2k + 4k + ... + 2^n . k) = O(k^2) = O(N^2), where 2^n = O(k), and each one takes O(N) time to construct (actually a bit less since it's the log of a smaller number, but the log doesn't increase much), so it seems like this is an O(N^3) algorithm to construct the search intervals.
Of course the full algorithm is not O(N^3), because each interval may contain many elements, so listing them all cannot be better than O(2^N) in general, but let's ignore this and assume that there are not enough elements of X to overwhelm the O(N^3) estimate.
Another issue is that std::map
cannot tell you how many elements lie within an interval (unlike a sorted std::vector
) so you don't know when to break off the partitioning and search through all remaining X elements in the interval. Of course, you have an upper bound on the number of X elements (the size of the full interval) but it may be quite poor.
EDIT: the answer to another question shows how to have a std::set
-like structure which also quickly gives you the number of elements in a range, which obviously could be adapted to std::map
-like structures. This would work well here for pruning (although annoying that, for C++, you'd have to reimplement most of std::map
!)
ANSWER
Answered 2021-Oct-29 at 04:33Treating your integers as strings of 0
and 1
, build a customized version of patricia tree, with the following rule:
- During lookup, if
1
is the current input bit at a branch, continue down both subtrees
The collection of all valid leaf nodes reached will be the answer.
ComplexityLet n be size of X,
Time: O(n)
- Worst case
-1
, when all subtrees are traversed. Complexity is bound by total number of nodes, stated below
Space: O(n)
- Number of nodes in a patricia tree is exactly 2n - 1
Given that your match condition is (A & B) == B
, a truth table is thus:
Hence, during lookup, we collect both subtrees on a branch node when the input bit is 1
.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install bottleneck
You can use bottleneck like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page