overhead | Stupidy overcomplicated discord framework for building | Chat library

by Noxime Rust Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | overhead Summary

overhead is a Rust library typically used in Messaging, Chat, Framework applications. overhead has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

Stupidy overcomplicated discord framework for building high performance bots

Support

Quality

Security

License

Reuse

Support

overhead has a low active ecosystem.

It has 1 star(s) with 0 fork(s). There are no watchers for this library.

It had no major release in the last 6 months.

overhead has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of overhead is current.

Quality

overhead has 0 bugs and 0 code smells.

Security

overhead has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

overhead code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

overhead does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

overhead releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of overhead

Get all kandi verified functions for this library.

overhead Key Features

No Key Features are available at this moment for overhead.

overhead Examples and Code Snippets

No Code Snippets are available at this moment for overhead.

Community Discussions

Trending Discussions on overhead

Why is `np.sum(range(N))` very slow?

Why does static_cast conversion speed up an un-optimized build of my integer division function?

What is 'serviceability memory category' of Native Memory Tracking?

How to use of laziness in Scheme efficiently?

FFMPEG's xstack command results in out of sync sound, is it possible to mix the audio in a single encoding?

Best way to abstract away an init function?

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Fast idiomatic Floyd-Warshall algorithm in Rust

When should you use tokio::join!() over tokio::spawn()?

Is it possible for other x86-64 emulators on M1 to leverage the same optimizations used by Rosetta 2?

QUESTION

Why is `np.sum(range(N))` very slow?

Asked 2022-Mar-29 at 14:31

I saw a video about speed of loops in python, where it was explained that doing sum(range(N)) is much faster than manually looping through range and adding the variables together, since the former runs in C due to built-in functions being used, while in the latter the summation is done in (slow) python. I was curious what happens when adding numpy to the mix. As I expected np.sum(np.arange(N)) is the fastest, but sum(np.arange(N)) and np.sum(range(N)) are even slower than doing the naive for loop.

Why is this?

Here's the script I used to test, some comments about the supposed cause of slowing done where I know (taken mostly from the video) and the results I got on my machine (python 3.10.0, numpy 1.21.2):

updated script:

...

ANSWER

Answered 2021-Oct-16 at 17:42

From the cpython source code for sum sum initially seems to attempt a fast path that assumes all inputs are the same type. If that fails it will just iterate:

Source https://stackoverflow.com/questions/69584027

QUESTION

Why does static_cast conversion speed up an un-optimized build of my integer division function?

Asked 2022-Mar-17 at 15:27

... or rather, why does not static_cast-ing slow down my function?

Consider the function below, which performs integer division:

...

ANSWER

Answered 2022-Mar-17 at 15:27

I'm keeping this answer up for now as the comments are useful.

Source https://stackoverflow.com/questions/71509342

QUESTION

What is 'serviceability memory category' of Native Memory Tracking?

Asked 2022-Jan-17 at 13:38

I have an java app (JDK13) running in a docker container. Recently I moved the app to JDK17 (OpenJDK17) and found a gradual increase of memory usage by docker container.

During investigation I found that the 'serviceability memory category' NMT grows constantly (15mb per an hour). I checked the page https://docs.oracle.com/en/java/javase/17/troubleshoot/diagnostic-tools.html#GUID-5EF7BB07-C903-4EBD-A9C2-EC0E44048D37 but this category is not mentioned there.

Could anyone explain what this serviceability category means and what can cause such gradual increase? Also there are some additional new memory categories comparing to JDK13. Maybe someone knows where I can read details about them.

Here is the result of command jcmd 1 VM.native_memory summary

...

ANSWER

Answered 2022-Jan-17 at 13:38

Unfortunately (?), the easiest way to know for sure what those categories map to is to look at OpenJDK source code. The NMT tag you are looking for is mtServiceability. This would show that "serviceability" are basically diagnostic interfaces in JDK/JVM: JVMTI, heap dumps, etc.

But the same kind of thing is clear from observing that stack trace sample you are showing mentions ThreadStackTrace::dump_stack_at_safepoint -- that is something that dumps the thread information, for example for jstack, heap dump, etc. If you have a suspicion for the memory leak in that code, you might try to build a MCVE demonstrating it, and submitting the bug against OpenJDK, or showing it to a fellow OpenJDK developer. You probably know better what your application is doing to cause thread dumps, focus there.

That being said, I don't see any obvious memory leaks in StackFrameInfo, neither can I reproduce any leak with stress tests, so maybe what you are seeing is "just" thread dumping over the larger and larger thread stacks. Or you capture it when thread dump is happening. Or... It is hard to say without the MCVE.

Update: After playing with MCVE, I realized that it reproduces with 17.0.1, but not with either mainline development JDK, or JDK 18 EA, or JDK 17.0.2 EA. I tested with 17.0.2 EA before, so was not seeing it, dang. Bisection between 17.0.1 and 17.0.2 EA shows it was fixed with JDK-8273902 backport. 17.0.2 releases this week, so the bug should disappear after you upgrade.

Source https://stackoverflow.com/questions/70709971

QUESTION

How to use of laziness in Scheme efficiently?

Asked 2021-Dec-30 at 10:19

I am trying to encode a small lambda calculus with algebraic datatypes in Scheme. I want it to use lazy evaluation, for which I tried to use the primitives delay and force. However, this has a large negative impact on the performance of evaluation: the execution time on a small test case goes up by a factor of 20x.

While I did not expect laziness to speed up this particular test case, I did not expect a huge slowdown either. My question is thus: What is causing this huge overhead with lazy evaluation, and how can I avoid this problem while still getting lazy evaluation? I would already be happy to get within 2x the execution time of the strict version, but faster is of course always better.

Below are the strict and lazy versions of the test case I used. The test deals with natural numbers in unary notation: it constructs a sequence of 2^24 sucs followed by a zero and then destructs the result again. The lazy version was constructed from the strict version by adding delay and force in appropriate places, and adding let-bindings to avoid forcing an argument more than once. (I also tried a version where zero and suc were strict but other functions were lazy, but this was even slower than the fully lazy version so I omitted it here.)

I compiled both programs using compile-file in Chez Scheme 9.5 and executed the resulting .so files with petite --program. Execution time (user only) for the strict version was 0.578s, while the lazy version takes 11,891s, which is almost exactly 20x slower.

Strict version ...

ANSWER

Answered 2021-Dec-28 at 16:24

This sounds very like a problem that crops up in Haskell from time to time. The problem is one of garbage collection.

There are two ways that this can go. Firstly, the lazy list can be consumed as it is used, so that the amount of memory consumed is limited. Or, secondly, the lazy list can be evaluated in a way that it remains in memory all of the time, with one end of the list pinned in place because it is still being used - the garbage collector objects to this and spends a lot of time trying to deal with this situation.

Haskell can be as fast as C, but requires the calculation to be strict for this to be possible.

I don't entirely understand the code, but it appears to be recursively creating a longer and longer list, which is then evaluated. Do you have the tools to measure the amount of memory that the garbage collector is having to deal with, and how much time the garbage collector runs for?

Source https://stackoverflow.com/questions/70501342

QUESTION

FFMPEG's xstack command results in out of sync sound, is it possible to mix the audio in a single encoding?

Asked 2021-Dec-16 at 21:11

I wrote a python script that generates a xstack complex filter command. The video inputs is a mixture of several formats described here:

I have 2 commands generated, one for the xstack filter, and one for the audio mixing.

Here is the stack command: (sorry the text doesn't wrap!)

...

ANSWER

Answered 2021-Dec-16 at 21:11

I'm a bit confused as how FFMPEG handles diverse framerates

It doesn't, which would cause a misalignment in your case. The vast majority of filters (any which deal with multiple sources and make use of frames, essentially), including the Concatenate filter require that be the sources have the same framerate.

For the concat filter to work, the inputs have to be of the same frame dimensions (e.g., 1920⨉1080 pixels) and should have the same framerate.

(emphasis added)

The documentation also adds:

Therefore, you may at least have to add a scale or scale2ref filter before concatenating videos. A handful of other attributes have to match as well, like the stream aspect ratio. Refer to the documentation of the filter for more info.

You should convert your sources to the same framerate first.

Source https://stackoverflow.com/questions/70020874

QUESTION

Best way to abstract away an init function?

Asked 2021-Dec-13 at 10:16

I am making a low level library that requires initialization to work properly which I implemented with a init function. I am wondering if there is a way to make the init call be called once the user calls a library function ideally without:

Any overhead
No repeated calls
No exposed global variables. (my current solution does this, which I don't quite like)

my current solution as per comment request:

...

ANSWER

Answered 2021-Dec-13 at 06:56

If you're happy with a solution that is a common extension rather than part of the C standard, you can mark your init function with the constructor attribute, which ensures it will be called automatically during program initialization (or during shared library load if you eventually end up using that).

Source https://stackoverflow.com/questions/70330627

QUESTION

Loop takes more cycles to execute than expected in an ARM Cortex-A72 CPU

Asked 2021-Dec-03 at 06:02

Consider the following code, running on an ARM Cortex-A72 processor (optimization guide here). I have included what I expect are resource pressures for each execution port:

Instruction B I0 I1 M L S F0 F1 .LBB0_1: ldr q3, [x1], #16 0.5 0.5 1 ldr q4, [x2], #16 0.5 0.5 1 add x8, x8, #4 0.5 0.5 cmp x8, #508 0.5 0.5 mul v5.4s, v3.4s, v4.4s 2 mul v5.4s, v5.4s, v0.4s 2 smull v6.2d, v5.2s, v1.2s 1 smull2 v5.2d, v5.4s, v2.4s 1 smlal v6.2d, v3.2s, v4.2s 1 smlal2 v5.2d, v3.4s, v4.4s 1 uzp2 v3.4s, v6.4s, v5.4s 1 str q3, [x0], #16 0.5 0.5 1 b.lo .LBB0_1 1 Total port pressure 1 2.5 2.5 0 2 1 8 1

Although uzp2 could run on either the F0 or F1 ports, I chose to attribute it entirely to F1 due to high pressure on F0 and zero pressure on F1 other than this instruction.

There are no dependencies between loop iterations, other than the loop counter and array pointers; and these should be resolved very quickly, compared to the time taken for the rest of the loop body.

Thus, my intuition is that this code should be throughput limited, and considering the worst pressure is on F0, run in 8 cycles per iteration (unless it hits a decoding bottleneck or cache misses). The latter is unlikely given the streaming access pattern, and the fact that arrays comfortably fit in L1 cache. As for the former, considering the constraints listed on section 4.1 of the optimization manual, I project that the loop body is decodable in only 8 cycles.

Yet microbenchmarking indicates that each iteration of the loop body takes 12.5 cycles on average. If no other plausible explanation exists, I may edit the question including further details about how I benchmarked this code, but I'm fairly certain the difference can't be attributed to benchmarking artifacts alone. Also, I have tried to increase the number of iterations to see if performance improved towards an asymptotic limit due to startup/cool-down effects, but it appears to have done so already for the selected value of 128 iterations displayed above.

Manually unrolling the loop to include two calculations per iteration decreased performance to 13 cycles; however, note that this would also duplicate the number of load and store instructions. Interestingly, if the doubled loads and stores are instead replaced by single LD1/ST1 instructions (two-register format) (e.g. ld1 { v3.4s, v4.4s }, [x1], #32) then performance improves to 11.75 cycles per iteration. Further unrolling the loop to four calculations per iteration, while using the four-register format of LD1/ST1, improves performance to 11.25 cycles per iteration.

In spite of the improvements, the performance is still far away from the 8 cycles per iteration that I expected from looking at resource pressures alone. Even if the CPU made a bad scheduling call and issued uzp2 to F0, revising the resource pressure table would indicate 9 cycles per iteration, still far from actual measurements. So, what's causing this code to run so much slower than expected? What kind of effects am I missing in my analysis?

EDIT: As promised, some more benchmarking details. I run the loop 3 times for warmup, 10 times for say n = 512, and then 10 times for n = 256. I take the minimum cycle count for the n = 512 runs and subtract from the minimum for n = 256. The difference should give me how many cycles it takes to run for n = 256, while canceling out the fixed setup cost (code not shown). In addition, this should ensure all data is in the L1 I and D cache. Measurements are taken by reading the cycle counter (pmccntr_el0) directly. Any overhead should be canceled out by the measurement strategy above.

...

ANSWER

Answered 2021-Nov-06 at 13:50

First off, you can further reduce the theoretical cycles to 6 by replacing the first mul with uzp1 and doing the following smull and smlal the other way around: mul, mul, smull, smlal => smull, uzp1, mul, smlal This also heavily reduces the register pressure so that we can do an even deeper unrolling (up to 32 per iteration)

And you don't need v2 coefficents, but you can pack them to the higher part of v1

Let's rule out everything by unrolling this deep and writing it in assembly:

Source https://stackoverflow.com/questions/69855672

QUESTION

Fast idiomatic Floyd-Warshall algorithm in Rust

Asked 2021-Nov-21 at 22:48

I am trying to implement a reasonably fast version of Floyd-Warshall algorithm in Rust. This algorithm finds a shortest paths between all vertices in a directed weighted graph.

The main part of the algorithm could be written like this:

...

ANSWER

Answered 2021-Nov-21 at 19:55

At first blush, one would hope this would be enough:

Source https://stackoverflow.com/questions/70050040

QUESTION

When should you use tokio::join!() over tokio::spawn()?

Asked 2021-Oct-20 at 16:11

Let's say I want to download two web pages concurrently with Tokio...

Either I could implement this with tokio::spawn():

...

ANSWER

Answered 2021-Oct-20 at 03:01

The difference will depend on how you have configured the runtime. tokio::join! will run tasks concurrently in the same task, while tokio::spawn! creates a new task for each.

In a single-threaded runtime, these are effectively the same. In a multi-threaded runtime, using tokio::spawn! twice like that may use two separate threads.

From the docs for tokio::join!:

By running all async expressions on the current task, the expressions are able to run concurrently but not in parallel. This means all expressions are run on the same thread and if one branch blocks the thread, all other expressions will be unable to continue. If parallelism is required, spawn each async expression using tokio::spawn and pass the join handle to join!.

For IO-bound tasks, like downloading web pages, you aren't going to notice the difference; most of the time will be spent waiting for packets and each task can efficiently interleave their processing.

Use tokio::spawn! when tasks are more CPU-bound and could block each other.

Source https://stackoverflow.com/questions/69638710

QUESTION

Is it possible for other x86-64 emulators on M1 to leverage the same optimizations used by Rosetta 2?

Asked 2021-Sep-04 at 20:10

I am curious about the vastly different performance characteristics of running x86-64 binaries on the Apple M1 platform using Rosetta 2 vs. emulation, for example what Docker Desktop currently does using QEMU.

I understand why emulation is so slow, but an explanation for why Rosetta 2 is so fast has been detailed in this Twitter thread: https://twitter.com/ErrataRob/status/1331735383193903104

The gist of that explanation is that under usual circumstances, arm and x86 have opposite (and incompatible) memory addressing schemes which require significant emulation overhead, but the M1 chip addresses this with a hardware optimization that allows it to access memory using both addressing schemes. Effectively, when Rosetta 2-emulated instructions are being run, a flag is set to let the processor know to use the x86-style addressing scheme.

Assuming this explanation is reasonable (and if anyone has better-sourced reporting than the above Twitter thread I would appreciate it in the comments for inclusion), is it technically plausible that this optimization could be leveraged for full hardware emulation, for example running x86-64 Linux Docker containers, or running a full x86-64 Windows desktop virtual machine a la VMware Fusion/VirtualBox? Or, does the separate operating system layer in those scenarios preclude being able to leverage the memory ordering optimization?

Separately, is this processor mode (flags or instructions) documented and published for 3rd-party use, or is it private to Apple only?

...

ANSWER

Answered 2021-Sep-04 at 20:09

Not memory addressing, memory ordering. i.e. for lock-free atomics used for inter-thread synchronization - in x86, every asm load/store is acquire/release respectively. (With real x86 CPUs doing speculative early loads so performance doesn't suffer under normal conditions when a single thread is operating on memory that other threads aren't writing.)

M1 has hardware support for a mode like that, as well as a weakly-ordered mode to run native AArch64 code most efficiently. See

And yes, https://github.com/saagarjha/TSOEnabler is open-source software to toggle that support. But it's a kernel extension, and code signing makes it tricky to get MacOS to allow it, and you have to sign it or disable signature verification or something:

Supposedly, you should be able to use this if you build and sign the kernel extension (disabling SIP if you aren't using a KEXT signing certificate) and drag it into /Library/Extensions. A dialog should come up to prompt you to enable the extension in the Security & Privacy preferences pane, you allow it from there and restart, and it will be installed. (If you're not seeing it, the permissions on the extension might be wrong: try a chown -R root:wheel.) In practice this can go wrong in many ways, and I have had luck by "resetting everything" and trying to install after doing the following:

[...] see the link for the list of steps

So yes, it's plausible that QEMU's x86 emulation could use the same hardware support that Rosetta-2's x86 emulation does. They're both x86 emulators.

And as you say, the main issue is Apple providing public APIs for enabling the HW mode so people don't have to install this kernel module manually; I'm sure most people wouldn't want to do that. I don't know much about the software situation, I was more interested in the CPU-architecture details.

Source https://stackoverflow.com/questions/69056518

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install overhead

You can download it from GitHub.
Rust is installed and managed by the rustup tool. Rust has a 6-week rapid release process and supports a great number of platforms, so there are many builds of Rust available at any time. Please refer rust-lang.org for more information.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: