rsp | Rapid SSH Proxy | Proxy library
kandi X-RAY | rsp Summary
kandi X-RAY | rsp Summary
Rapid SSH Proxy. Like ssh -ND, but much faster. rsp is a SSH client which implements SOCKS5 proxy feature of SSH protocol. Key feature of this implementation is use of multiple connections to overcome downsides of multiplexing many tunneled TCP connections in single SSH session. Multiple sessions are not limited with TCP window size of single connection and packet loss does not affect all tunneled connections at once. In order to cut latency of connection establishment rsp maintains pool of steady connections, which replenished with configurable rate.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Main function for MAIN
- Drain the queue
- Wait for the release
- Put an item into the queue
- Add a status to the queue
- Schedule the dispatcher to run
- Send data
- Get a connection from the pool
- Connect to upstream
- Rebuild the pool
- Parse command line arguments
- Attempt to schedule events
- Enable the uvloop
- Configure a logger
rsp Key Features
rsp Examples and Code Snippets
Community Discussions
Trending Discussions on rsp
QUESTION
Consider following examples for calculating sum of i32 array:
Example1: Simple for loop
...ANSWER
Answered 2022-Apr-09 at 09:13It appears you forgot to tell rustc it was allowed to use AVX2 instructions everywhere, so it couldn't inline those functions. Instead, you get a total disaster where only the wrapper functions are compiled as AVX2-using functions, or something like that.
Works fine for me with -O -C target-cpu=skylake-avx512
(https://godbolt.org/z/csY5or43T) so it can inline even the AVX512VL load you used, _mm256_load_epi32
1, and then optimize it into a memory source operand for vpaddd ymm0, ymm0, ymmword ptr [rdi + 4*rax]
(AVX2) inside a tight loop.
In GCC / clang, you get an error like "inlining failed in call to always_inline foobar
" in this case, instead of working but slow asm. (See this for details). This is something Rust should probably sort out before this is ready for prime time, either be like MSVC and actually inline the instruction into a function using the intrinsic, or refuse to compile like GCC/clang.
Footnote 1: See How to emulate _mm256_loadu_epi32 with gcc or clang? if you didn't mean to use AVX512.
With -O -C target-cpu=skylake
(just AVX2), it inlines everything else, including vpaddd ymm
, but still calls out to a function that copies 32 bytes from memory to memory with AVX vmovaps
. It requires AVX512VL to inline the intrinsic, but later in the optimization process it realizes that with no masking, it's just a 256-bit load it should do without a bloated AVX-512 instruction. It's kinda dumb that Intel even provided a no-masking version of _mm256_mask[z]_loadu_epi32
that requires AVX-512. Or dumb that gcc/clang/rustc consider it an AVX512 intrinsic.
QUESTION
I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.
Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.
After turning on optimization (-O3 -march=native -ffast-math
, gcc cross compiler), the Gprof result was the following:
Clearly, the convolution2D
function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.
The convolution function in question looks like this:
...ANSWER
Answered 2022-Mar-10 at 13:57Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.
So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.
I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).
The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.
QUESTION
I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.
...ANSWER
Answered 2022-Feb-25 at 05:34Agree with comments: your stack alignment is incorrect.
It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.
Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call
instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf
crash when it tries to use aligned move instructions.
You could subtract 8 from the sp
computed in your C code.
Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret
, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context
with just
QUESTION
I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:
...ANSWER
Answered 2022-Feb-03 at 04:10(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)
GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc
: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc
actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.
So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.
Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args
, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args
used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.
One difference from the book's asm is a movl $0, %eax
before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc
returns: it ends up using movslq %edx, %rdx
instead of cltq
(sign-extend with RAX).
CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.
Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.
QUESTION
I'm more like C++ than C, but this simple example of code is big surprise for me:
...ANSWER
Answered 2022-Jan-17 at 12:33This function is 100% correct.
In the C language int foo() {
means: define function foo
returning int
and taking unspecified number of parameters.
Your confusion comes from C++ plus where int foo(void)
and int foo()
mean exactly the same.
In the C language to define function which does not take any parameters you need to define it as:
QUESTION
I am somewhat new to Python. I have looked around but cannot find an answer that fits exactly what I am looking for.
I have a function that makes an HTTP call using the requests package. I'd like to print a '.' to the screen (or any char) say every 10 seconds while the HTTP requests executes, and stop printing when it finishes. So something like:
...ANSWER
Answered 2021-Dec-22 at 23:04i don't really getting what you looking for but if you want two things processed at the same time you can use multithreading
module
Example:
QUESTION
I have observed that GCC's C++ compiler generates the following assembler code:
...ANSWER
Answered 2021-Dec-22 at 18:10Try assembling both and you'll see why.
QUESTION
I'm using godbolt to get assembly of the following program:
...ANSWER
Answered 2021-Dec-13 at 06:33You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul
instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea
instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal
instruction (2 per cycle and 1 cycle of latency).
This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.
Note that the lea
instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea
is faster than imul
/mul
.
I can reproduce the slower execution with -O2
using GCC 11.2 (on Linux with a i5-9600KF processor).
The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2
version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.
Here is the assembly of the loop with -Os
:
QUESTION
I am trying to compare the methods mentioned by Peter Cordes in his answer to the question that 'set all bits in CPU register to 1'.
Therefore, I write a benchmark to set all 13 registers to all bits 1 except e/rsp
, e/rbp
, and e/rcx
.
The code is like below. times 32 nop
is used to avoid DSB and LSD influence.
ANSWER
Answered 2021-Nov-27 at 20:04The bottleneck in all of your examples is the predecoder.
I analyzed your examples with my simulator uiCA (https://uica.uops.info/, https://github.com/andreas-abel/uiCA). It predicts the following throughputs, which closely match your measurements:
TP Link g1a 13.00 https://uica.uops.info/?code=... g1b 14.00 https://uica.uops.info/?code=... g2a 16.00 https://uica.uops.info/?code=... g2b 17.00 https://uica.uops.info/?code=... g3a 17.00 https://uica.uops.info/?code=... g3b 18.00 https://uica.uops.info/?code=... g4a 12.00 https://uica.uops.info/?code=... g4b 12.00 https://uica.uops.info/?code=...The trace table that uiCA generates provides some insights into how the code is executed. For g1a, for example, it generates the following trace:
You can see that for the 32 nops, the predecoder requires 8 cycles, and for the remaining instructions, it requires 5 cycles, which together corresponds to the 13 cycles that you measured.
You may notice that in some cycles, only a small number of instructions is predecoded; for example, in the fourth cycle, only one instruction is predecoded. This is because the predecoder works on aligned 16-byte blocks, and it can handle at most five instructions per cycle (note that some sources incorrectly claim that it can handle 6 instructions per cycle). You can find more details on the predecoder, for example how it handles instructions that cross a 16-byte boundary, in this paper.
If you compare this trace with the trace for g1b, you can see that the instructions after the nops now require 6 instead of 5 cycles to be predecoded, which is because several of the instructions in g1b are longer than the corresponding ones in g1a.
QUESTION
I have the following code:
...ANSWER
Answered 2021-Nov-17 at 06:27"Lock" here is in the sense of "mutex", not specifically in reference to the x86 instruction prefix named lock
.
A trivial and generic way to implement std::atomic
for arbitrary types T
would be as a class containing a T
member together with a std::mutex
, which is locked and unlocked around every operation on the object (load, store, exchange, fetch_add, etc). Those operations can then be done in any old way, and need not use atomic machine instructions, because the lock protects them. This implementation would be not lock free.
A downside of such an implementation, besides being slow in general, is that if two threads try to operate on the object at the same time, one of them will have to wait for the lock, which may actually block and cause it to be scheduled out for a while. Or, if a thread gets scheduled out while holding the lock, every other thread that wants to operate on the object will have to wait for the first thread to get scheduled back in and complete its work first.
So it is desirable if the machine supports truly atomic operations on T
: a single instruction or sequence that other threads cannot interfere with, and which will not block other threads if interrupted (or perhaps cannot be interrupted at all). If for some type T
the library has been able to specialize std::atomic
with such an implementation, then that is what we mean by saying it is lock free. (It is just confusing on x86 because the atomic instructions used for such implementations are named lock
. On other architectures they might be called something else, e.g. ARM64's ldxr/stxr
exclusive load/store instructions.)
The C++ standard allows for types to be "sometimes lock free": maybe it is not known at compile time whether std::atomic
will be lock-free, because it depends on special machine features that will be detected at runtime. It's even possible that some objects of type std::atomic
are lock-free and others are not. That's why atomic_is_lock_free
is a function and not a constant. It checks whether this particular object is lock-free on this particular day.
However, it might be the case for some implementations that certain types can be guaranteed, at compile time, to always be lock free. That's what is_always_lock_free
is used to indicate, and note that it's a constexpr bool
instead of a function.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install rsp
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page