bottleneck | Fast NumPy array functions written in C | Functional Programming library

by pydata Python Version: 1.4.0rc6 License: BSD-2-Clause

X-Ray Key Features Code Snippets(2)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | bottleneck Summary

bottleneck is a Python library typically used in Programming Style, Functional Programming, Numpy, Pandas applications. bottleneck has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install bottleneck' or download it from GitHub, PyPI.

Fast NumPy array functions written in C

Support

Quality

Security

License

Reuse

Support

bottleneck has a medium active ecosystem.

It has 839 star(s) with 93 fork(s). There are 31 watchers for this library.

There were 3 major release(s) in the last 12 months.

There are 36 open issues and 191 have been closed. On average issues are closed in 156 days. There are 6 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of bottleneck is 1.4.0rc6

Quality

bottleneck has 0 bugs and 0 code smells.

Security

bottleneck has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

bottleneck code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

bottleneck is licensed under the BSD-2-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

bottleneck releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

It has 4209 lines of code, 251 functions and 41 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of bottleneck

Get all kandi verified functions for this library.

bottleneck Key Features

No Key Features are available at this moment for bottleneck.

bottleneck Examples and Code Snippets

Creates a shared_embedding_columns_collection .

python

Lines of Code : 208

License : Non-SPDX (Apache License 2.0)

Copy

def shared_embedding_columns_v2(categorical_columns,
                                dimension,
                                combiner='mean',
                                initializer=None,
                                shared_embedding_collec

Embed a categorical column .

python

Lines of Code : 153

License : Non-SPDX (Apache License 2.0)

Copy

def embedding_column_v2(categorical_column,
                        dimension,
                        combiner='mean',
                        initializer=None,
                        max_sequence_length=0,
                        learning_rate_fn=

Community Discussions

Trending Discussions on bottleneck

Convolution Function Latency Bottleneck

Achieving numpy like fast interpolation in Fortran

How to install local package with conda

Why are custom validation messages causing my HTML form elements to stay invalid?

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Julia Distributed slow down to half the single core performance when adding process

Fast communication between C++ and python using shared memory

More efficient way to purrr::map2 for a large dataframe

How to implement deep and shallow copy constructors without using a boolean parameter?

Is there a good data structure for finding all stored subsets of a given bitset?

QUESTION

Convolution Function Latency Bottleneck

Asked 2022-Mar-10 at 13:57

I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.

Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.

After turning on optimization (-O3 -march=native -ffast-math, gcc cross compiler), the Gprof result was the following:

Clearly, the convolution2D function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.

The convolution function in question looks like this:

...

ANSWER

Answered 2022-Mar-10 at 13:57

Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.

So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.

I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).

The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.

Source https://stackoverflow.com/questions/71401876

QUESTION

Achieving numpy like fast interpolation in Fortran

Asked 2022-Feb-06 at 15:42

I have a numerical routine that I need to run to solve a certain equation, which contains a few nested four loops. I initially wrote this routine into Python, using numba.jit to achieve an acceptable performance. For large system sizes however, this method becomes quite slow, so I have been rewriting the routine into Fortran hoping to achieve a speed-up. However I have found that my Fortran version is much slower than the first version in Python, by a factor of 2-3.

I believe the bottleneck is a linear interpolation function that is called at each innermost loop. In the Python implementation I use numpy.interp, which seems to be pretty fast when combined with numba.jit. In Fortran I wrote my own interpolation function, which reads,

...

ANSWER

Answered 2022-Feb-06 at 15:42

At a guess (and see @IanBush's comments if you want to enable us to do better than guessing), it's the line

Source https://stackoverflow.com/questions/71007062

QUESTION

How to install local package with conda

Asked 2022-Feb-05 at 04:16

I have a local python project called jive that I would like to use in an another project. My current method of using jive in other projects is to activate the conda env for the project, then move to my jive directory and use python setup.py install. This works fine, and when I use conda list, I see everything installed in the env including jive, with a note that jive was installed using pip.

But what I really want is to do this with full conda. When I want to use jive in another project, I want to just put jive in that projects environment.yml.

So I did the following:

write a simple meta.yaml so I could use conda-build to build jive locally
build jive with conda build .
I looked at the tarball that was produced and it does indeed contain the jive source as expected
In my other project, add jive to the dependencies in environment.yml, and add 'local' to the list of channels.
create a conda env using that environment.yml.

When I activate the environment and use conda list, it lists all the dependencies including jive, as desired. But when I open python interpreter, I cannot import jive, it says there is no such package. (If use python setup.py install, I can import it.) How can I fix the build/install so that this works?

Here is the meta.yaml, which lives in the jive project top level directory:

...

ANSWER

Answered 2022-Feb-05 at 04:16

The immediate error is that the build is generating a Python 3.10 version, but when testing Conda doesn't recognize any constraint on the Python version, and creates a Python 3.9 environment.

I think the main issue is that python >=3.5 is only a valid constraint when doing noarch builds, which this is not. That is, once a package builds with a given Python version, the version must be constrained to exactly that version (up through minor). So, in this case, the package is built with Python 3.10, but it reports in its metadata that it is compatible with all versions of Python 3.5+, which simply isn't true because Conda Python packages install the modules into Python-version-specific site-packages (e.g., lib/python-3.10/site-packages/jive).

Typically, Python versions are controlled by either the --python argument given to conda-build or a matrix supplied by the conda_build_config.yaml file (see documentation on "Build variants").

Try adjusting the meta.yaml to something like

Source https://stackoverflow.com/questions/70705250

QUESTION

Why are custom validation messages causing my HTML form elements to stay invalid?

Asked 2022-Jan-03 at 16:18

I've implemented a custom validation message on my input for the pattern validation rule while leaving the default message for required as is. However, when I do so, once the input becomes invalid, it never becomes valid again, even though I am meeting the pattern criteria.

...

ANSWER

Answered 2022-Jan-03 at 16:18

First of all, per MDN:

It's vital to set the message to an empty string if there are no errors. As long as the error message is not empty, the form will not pass validation and will not be submitted.

This agrees with that the HTML standard says:

Suffering from a custom error

When a control's custom validity error message (as set by the element's setCustomValidity() method or ElementInternals's setValidity() method) is not the empty string.

An element satisfies its constraints if it is not suffering from any of the above validity states.

Your sample does not clear the custom error if the form field is determined to be valid. As such, once the field is determined invalid, it stays so for the remainder of the session.

Moreover, you modify custom error only after the field has already been determined invalid. This means the form will still not be submitted even if you clear the message in the same handler.

A better way too accomplish your goal would be to monitor the field in the change event handler for the field and set the custom message there:

Source https://stackoverflow.com/questions/70514086

QUESTION

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Asked 2021-Dec-03 at 06:02

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:

Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1], #16 0.5 0.5 1 ldr q4, [x2], #16 0.5 0.5 1 add x8, x8, #4 0.5 0.5 cmp x8, #508 0.5 0.5 mul v5.4s, v3.4s, v4.4s 2 mul v5.4s, v5.4s, v0.4s 2 smull v6.2d, v5.2s, v1.2s 1 smull2 v5.2d, v5.4s, v2.4s 1 smlal v6.2d, v3.2s, v4.2s 1 smlal2 v5.2d, v3.4s, v4.4s 1 uzp2 v3.4s, v6.4s, v5.4s 1 str q3, [x0], #16 0.5 0.5 1 b.lo .LBB0_1 1 Total port pressure 1 2.5 2.5 0 2 1 8 1

Although uzp2 could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.

There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.

...

ANSWER

Answered 2021-Nov-06 at 13:50

First off, you can further reduce the theoretical cycles to 6 by replacing the first mul with uzp1 and doing the following smull and smlal the other way around: mul, mul, smull, smlal => smull, uzp1, mul, smlal This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)

And you don't need v2 coefficents, but you can pack them to the higher part of v1

Let's rule out everything by unrolling this deep and writing it in assembly:

Source https://stackoverflow.com/questions/69855672

QUESTION

Julia Distributed slow down to half the single core performance when adding process

Asked 2021-Nov-17 at 17:18

I've got a function func that may cost ~50s when running on a single core. Now I want to run it on a server which has got 192-core CPUs for many times. But when I add worker processes to say, 180, the performance of each core slows down. The worst CPU takes ~100s to calculate func.

Can someone help me, please?

Here is the pseudo code

...

ANSWER

Answered 2021-Nov-17 at 17:18

You are measuring the time it takes each worker to perform func() and observe performance decrease for a single process when going from 10 processes to 180 parallel processes.

This looks quite normal to me:

Intel cores use hyper-threading so you actually have 96 cores (in more detail - a hyper-threaded core adds only 20-30% performance). It means that 168 of your processes need to share 84 hyper-threaded cores and 12 processes get full 12 cores.
The CPU speed is determined by throttle temperature (https://en.wikipedia.org/wiki/Thermal_design_power) and of course there is so much more space when running 10 processes vs 180 processes
Your tasks are obviously competing for memory. They make a total of over 5TB of memory allocations and you machine has much less than that. The last mile in garbage collecting is always the most expensive one - so if your garbage collectors are squeezed and competing for memory the performance is uneven with surprisingly longer garbage collection times.

Looking at this data I would recommend you to try:

Source https://stackoverflow.com/questions/70006813

QUESTION

Fast communication between C++ and python using shared memory

Asked 2021-Nov-02 at 07:11

In a cross platform (Linux and windows) real-time application, I need the fastest way to share data between a C++ process and a python application that I both manage. I currently use sockets but it's too slow when using high-bandwith data (4K images at 30 fps).

I would ultimately want to use the multiprocessing shared memory but my first tries suggest it does not work. I create the shared memory in C++ using Boost.Interprocess and try to read it in python like this:

...

ANSWER

Answered 2021-Sep-15 at 19:53

So I spent the last days implementing shared memory using mmap, and the results are quite good in my opinion. Here are the benchmarks results comparing my two implementations: pure TCP and mix of TCP and shared memory.

Protocol:

Benchmark consists of moving data from C++ to Python world (using python's numpy.nparray), then data sent back to C++ process. No further processing is involved, only serialization, deserialization and inter-process communication (IPC).

Case A:

One C++ process implementing TCP communication using Boost.Asio
One Python3 process using standard python TCP sockets

Communication is done with TCP {header + data}.

Case B:

One C++ process implementing TCP communication using Boost.Asio and shared memory (mmap) using Boost.Interprocess
One Python3 process using standard TCP sockets and mmap

Communication is hybrid : synchronization is done through sockets (only header is passed) and data is moved through shared memory. I think this design is great because I have suffered in the past from problem of synchronization using condition variable in shared memory, and TCP is easy to use in both C++ and Python environments.

Results: Small data at high frequency

200 MBytes/s total: 10 MByte sample at 20 samples per second

Case Global CPU consumption C++ part python part A 17.5 % 10% 7.5% B 6% 1% 5% Big data at low frequency

200 MBytes/s total: 0.2 MByte sample at 1000 samples per second

Case Global CPU consumption C++ part python part A 13.5 % 6.7% 6.8% B 11% 5.5% 5.5% Max bandwidth

A : 250 MBytes / second
B : 600 MBytes / second

Conclusion:

In my application, using mmap has a huge impact on big data at average frequency (almost 300 % performance gain). When using very high frequencies and small data, the benefit of shared memory is still there but not that impressive (only 20% improvement). Maximum throughput is more than 2 times bigger.

Using mmap is a good upgrade for me. I just wanted to share my results here.

Source https://stackoverflow.com/questions/69091769

QUESTION

More efficient way to purrr::map2 for a large dataframe

Asked 2021-Nov-01 at 23:22

Is there a faster way to do the following, where in the real application, df has many rows (and therefore list_of_colnames has the same number of elements):

...

ANSWER

Answered 2021-Nov-01 at 17:46

One option may be to make use of row/column indexing

Source https://stackoverflow.com/questions/69799742

QUESTION

How to implement deep and shallow copy constructors without using a boolean parameter?

Asked 2021-Nov-01 at 10:59

I have the following class:

...

ANSWER

Answered 2021-Nov-01 at 10:59

Assuming B has a copy constructor, just add:

Source https://stackoverflow.com/questions/69795658

QUESTION

Is there a good data structure for finding all stored subsets of a given bitset?

Asked 2021-Oct-29 at 08:30

Let X be a set of distinct 64-bit unsigned integers std::uint64_t, each one being interpreted as a bitset representing a subset of {1,2,...,64}.

I want a function to do the following: given a std::uint64_t A, not necessarily in X, list all B in X, such that B is a subset of A, when A, B are interpreted as subsets of {1,2,...,64}.

(Of course, in C++ this condition is just (A & B) == B).

Since A itself need not be in X, I believe that this is not a duplicate of other questions.

X will grow over time (but nothing will be deleted), although there will be far more queries than additions to X.

I am free to choose the data structure representing the elements of X.

Obviously, we could represent X as a std::set or sorted std::vector of std::uint64_t, and I give one algorithm below. But can we do better?

What are good data structures for X and algorithms to do this efficiently? This should be a standard problem but I couldn't find anything.

EDIT: sorry if this is too vague. Obviously, if X were a std::set we could search through all subsets of A, taking time O(2^m log |X|) with m <= N, or all elements of X in time O(|X| log |X|).

Assume that in most cases, the number of B is quite a bit smaller than both 2^m (the number of subsets of A) and |X|. So, we want some kind of algorithm to run in time much less than |X| or 2^m in such cases, ideally in time O(number of B) but that's surely too optimistic. Obviously, O(|X|) cannot be beaten in the worst case.

Obviously some memory overhead for X is expected, and memory is less of a bottleneck than time for me. Using memory roughly 10 * (the memory of X stored as a std::set) is fine. Much more than this is too much. (Asymptotically, anything more than O(|X|) or O(|X| log |X|) memory is probably too much).

Obviously, the use of C++ is not essential: the algorithms/data structures are the important things here.

In the case that X is fixed, maybe Hasse diagrams could work.

It looks like Hasse diagrams would be quite time-consuming to construct each time X grows. (But still maybe worth a try if nothing else comes up). EDIT: maybe not so slow to update, so better than I thought.

The below is just my idea so far; maybe something better can be found?

FINAL edit: since it's closed, probably fairly - the "duplicate" question is pretty close - I won't bother with any further edits. I will probably do the below, but using a probabilistic skip list structure instead of a std::set, and augmented with skip distances (so you can quickly calculate how many X elements remain in an interval, and thus reduce the number of search intervals, by switching to linear search when the intersection gets small). This is similar to Order Statistic Trees given in this question, but skip lists are a lot easier to reimplement than std::set (especially as I don't need deletions).

Represent X as a std::set or sorted std::vector of 64-bit unsigned integers std::uint64_t, using the ordinary numerical order, and do recursive searches within smaller and smaller intervals.

E.g., my query element is A = 10011010. Subsets of A containing the first bit lie in the inclusive interval [10000000, 10011010].

Subsets of A containing the second bit but not the first lie in the interval [00010000, 00011010].

Those with the third but not the second bit are in [00001000, 00001010].

Those with the fourth but not the third bit are in [00000010, 00000010].

Now, within the first interval [10000000, 10011010] you could make two subintervals to search, based on the second bit: [10000000, 10001010] and [10010000, 10011010].

Thus you can break it down recursively in this manner. The total length of search intervals is getting smaller all the time, so this is surely going to be better asymptotically than a trivial linear search through all of X.

E.g., if X = {00000010, 00001000, 00110111, 10011100} then only the first, third, fourth depth-1 intervals would have nonempty intersection with X. The final returned result would be [00000010, 00001000].

Of course this is unbalanced if the X elements are distributed fairly uniformly. We might want the search intervals to have roughly equal width at each depth, and they don't; above, the sizes of the four depth-1 search intervals are, I think, 27, 11, 3, 1, and for larger N the discrepancies could be much bigger.

If there are k bits in the query set A, then you'll have to construct k initial search intervals at depth 1 (searching on ONE bit), then up to 2k search intervals at depth 2, 4k at depth 3, etc.

If I've got it right, since log |X| = O(N) the number of search intervals is O(k + 2k + 4k + ... + 2^n . k) = O(k^2) = O(N^2), where 2^n = O(k), and each one takes O(N) time to construct (actually a bit less since it's the log of a smaller number, but the log doesn't increase much), so it seems like this is an O(N^3) algorithm to construct the search intervals.

Of course the full algorithm is not O(N^3), because each interval may contain many elements, so listing them all cannot be better than O(2^N) in general, but let's ignore this and assume that there are not enough elements of X to overwhelm the O(N^3) estimate.

Another issue is that std::map cannot tell you how many elements lie within an interval (unlike a sorted std::vector) so you don't know when to break off the partitioning and search through all remaining X elements in the interval. Of course, you have an upper bound on the number of X elements (the size of the full interval) but it may be quite poor.

EDIT: the answer to another question shows how to have a std::set-like structure which also quickly gives you the number of elements in a range, which obviously could be adapted to std::map-like structures. This would work well here for pruning (although annoying that, for C++, you'd have to reimplement most of std::map!)

...

ANSWER

Answered 2021-Oct-29 at 04:33

Solution

Treating your integers as strings of 0 and 1, build a customized version of patricia tree, with the following rule:

During lookup, if 1 is the current input bit at a branch, continue down both subtrees

The collection of all valid leaf nodes reached will be the answer.

Complexity

Let n be size of X,

Time: O(n)

Worst case -1, when all subtrees are traversed. Complexity is bound by total number of nodes, stated below

Space: O(n)

Number of nodes in a patricia tree is exactly 2n - 1

Rationale

Given that your match condition is (A & B) == B, a truth table is thus:

. A0 A1 B0 T T B1 F T

Hence, during lookup, we collect both subtrees on a branch node when the input bit is 1.

Source https://stackoverflow.com/questions/69721703

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install bottleneck

You can install using 'pip install bottleneck' or download it from GitHub, PyPI.
You can use bottleneck like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: