rsp | Rapid SSH Proxy | Proxy library

by Snawoot Python Version: 1.0.1 License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | rsp Summary

rsp is a Python library typically used in Networking, Proxy applications. rsp has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can install using 'pip install rsp' or download it from GitHub, PyPI.

Rapid SSH Proxy. Like ssh -ND, but much faster. rsp is a SSH client which implements SOCKS5 proxy feature of SSH protocol. Key feature of this implementation is use of multiple connections to overcome downsides of multiplexing many tunneled TCP connections in single SSH session. Multiple sessions are not limited with TCP window size of single connection and packet loss does not affect all tunneled connections at once. In order to cut latency of connection establishment rsp maintains pool of steady connections, which replenished with configurable rate.

Support

Quality

Security

License

Reuse

Support

rsp has a low active ecosystem.

It has 270 star(s) with 16 fork(s). There are 7 watchers for this library.

It had no major release in the last 12 months.

There are 3 open issues and 1 have been closed. On average issues are closed in 1 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of rsp is 1.0.1

Quality

rsp has 0 bugs and 0 code smells.

Security

rsp has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

rsp code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

rsp is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

rsp releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

It has 1036 lines of code, 91 functions and 13 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed rsp and discovered the below as its top functions. This is intended to give you an instant insight into rsp implemented functionality, and help decide if they suit your requirements.

Main function for MAIN
Drain the queue
Wait for the release
Put an item into the queue
Add a status to the queue
Schedule the dispatcher to run
Send data
Get a connection from the pool
Connect to upstream
Rebuild the pool
Parse command line arguments
Attempt to schedule events
Enable the uvloop
Configure a logger

Get all kandi verified functions for this library.

rsp Key Features

No Key Features are available at this moment for rsp.

rsp Examples and Code Snippets

No Code Snippets are available at this moment for rsp.

Community Discussions

Trending Discussions on rsp

Slow SIMD performance - no inlining

Convolution Function Latency Bottleneck

C fibers crashing on printf

Why does GCC allocate more stack memory than needed?

Why does an invalid use of C function compile fine without any warnings?

Python function to print a dot to console while HTTP call executes

Why GCC generates strange way to move stack pointer

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Do 32-bit and 64-bit registers cause differences in CPU micro architecture?

What does std::atomic::is_always_lock_free = true really mean?

QUESTION

Slow SIMD performance - no inlining

Asked 2022-Apr-10 at 07:02

Consider following examples for calculating sum of i32 array:

Example1: Simple for loop

...

ANSWER

Answered 2022-Apr-09 at 09:13

It appears you forgot to tell rustc it was allowed to use AVX2 instructions everywhere, so it couldn't inline those functions. Instead, you get a total disaster where only the wrapper functions are compiled as AVX2-using functions, or something like that.

Works fine for me with -O -C target-cpu=skylake-avx512 (https://godbolt.org/z/csY5or43T) so it can inline even the AVX512VL load you used, _mm256_load_epi32¹, and then optimize it into a memory source operand for vpaddd ymm0, ymm0, ymmword ptr [rdi + 4*rax] (AVX2) inside a tight loop.

In GCC / clang, you get an error like "inlining failed in call to always_inline foobar" in this case, instead of working but slow asm. (See this for details). This is something Rust should probably sort out before this is ready for prime time, either be like MSVC and actually inline the instruction into a function using the intrinsic, or refuse to compile like GCC/clang.

Footnote 1: See How to emulate _mm256_loadu_epi32 with gcc or clang? if you didn't mean to use AVX512.

With -O -C target-cpu=skylake (just AVX2), it inlines everything else, including vpaddd ymm, but still calls out to a function that copies 32 bytes from memory to memory with AVX vmovaps. It requires AVX512VL to inline the intrinsic, but later in the optimization process it realizes that with no masking, it's just a 256-bit load it should do without a bloated AVX-512 instruction. It's kinda dumb that Intel even provided a no-masking version of _mm256_mask[z]_loadu_epi32 that requires AVX-512. Or dumb that gcc/clang/rustc consider it an AVX512 intrinsic.

Source https://stackoverflow.com/questions/71806517

QUESTION

Convolution Function Latency Bottleneck

Asked 2022-Mar-10 at 13:57

I have implemented a Convolutional Neural Network in C and have been studying what parts of it have the longest latency.

Based on my research, the massive amounts of matricial multiplication required by CNNs makes running them on CPUs and even GPUs very inefficient. However, when I actually profiled my code (on an unoptimized build) I found out that something other than the multiplication itself was the bottleneck of the implementation.

After turning on optimization (-O3 -march=native -ffast-math, gcc cross compiler), the Gprof result was the following:

Clearly, the convolution2D function takes the largest amount of time to run, followed by the batch normalization and depthwise convolution functions.

The convolution function in question looks like this:

...

ANSWER

Answered 2022-Mar-10 at 13:57

Looking at the result of Cachegrind, it doesn't look like the memory is your bottleneck. The NN has to be stored in memory anyway, but if it's too large that your program's having a lot of L1 cache misses, then it's worth thinking to try to minimize L1 misses, but 1.7% of L1 (data) miss rate is not a problem.

So you're trying to make this run fast anyway. Looking at your code, what's happening at the most inner loop is very simple (load-> multiply -> add -> store), and it doesn't have any side effect other than the final store. This kind of code is easily parallelizable, for example, by multithreading or vectorizing. I think you'll know how to make this run in multiple threads seeing that you can write code with some complexity, and you asked in comments how to manually vectorize the code.

I will explain that part, but one thing to bear in mind is that once you choose to manually vectorize the code, it will often be tied to certain CPU architectures. Let's not consider non-AMD64 compatible CPUs like ARM. Still, you have the option of MMX, SSE, AVX, and AVX512 to choose as an extension for vectorized computation, and each extension has multiple versions. If you want maximum portability, SSE2 is a reasonable choice. SSE2 appeared with Pentium 4, and it supports 128-bit vectors. For this post I'll use AVX2, which supports 128-bit and 256-bit vectors. It runs fine on your CPU, and has reasonable portability these days, supported from Haswell (2013) and Excavator (2015).

The pattern you're using in the inner loop is called FMA (fused multiply and add). AVX2 has an instruction for this. Have a look at this function and the compiled output.

Source https://stackoverflow.com/questions/71401876

QUESTION

C fibers crashing on printf

Asked 2022-Feb-25 at 06:52

I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.

...

ANSWER

Answered 2022-Feb-25 at 05:34

Agree with comments: your stack alignment is incorrect.

It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.

Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf crash when it tries to use aligned move instructions.

You could subtract 8 from the sp computed in your C code.

Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context with just

Source https://stackoverflow.com/questions/71259613

QUESTION

Why does GCC allocate more stack memory than needed?

Asked 2022-Feb-03 at 08:12

I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:

...

ANSWER

Answered 2022-Feb-03 at 04:10

(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)

GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.

So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.

Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.

One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).

CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.

Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.

Source https://stackoverflow.com/questions/70953275

QUESTION

Why does an invalid use of C function compile fine without any warnings?

Asked 2022-Jan-17 at 20:16

I'm more like C++ than C, but this simple example of code is big surprise for me:

...

ANSWER

Answered 2022-Jan-17 at 12:33

This function is 100% correct.

In the C language int foo() { means: define function foo returning int and taking unspecified number of parameters.

Your confusion comes from C++ plus where int foo(void) and int foo() mean exactly the same.

In the C language to define function which does not take any parameters you need to define it as:

Source https://stackoverflow.com/questions/70741364

QUESTION

Python function to print a dot to console while HTTP call executes

Asked 2021-Dec-28 at 21:25

I am somewhat new to Python. I have looked around but cannot find an answer that fits exactly what I am looking for.

I have a function that makes an HTTP call using the requests package. I'd like to print a '.' to the screen (or any char) say every 10 seconds while the HTTP requests executes, and stop printing when it finishes. So something like:

...

ANSWER

Answered 2021-Dec-22 at 23:04

i don't really getting what you looking for but if you want two things processed at the same time you can use multithreading module Example:

Source https://stackoverflow.com/questions/70455932

QUESTION

Why GCC generates strange way to move stack pointer

Asked 2021-Dec-22 at 18:10

I have observed that GCC's C++ compiler generates the following assembler code:

...

ANSWER

Answered 2021-Dec-22 at 18:10

Try assembling both and you'll see why.

Source https://stackoverflow.com/questions/70453036

QUESTION

Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?

Asked 2021-Dec-13 at 10:27

I'm using godbolt to get assembly of the following program:

...

ANSWER

Answered 2021-Dec-13 at 06:33

You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal instruction (2 per cycle and 1 cycle of latency).

This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.

Note that the lea instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea is faster than imul/mul.

UPDATE (v2):

I can reproduce the slower execution with -O2 using GCC 11.2 (on Linux with a i5-9600KF processor).

The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2 version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.

Here is the assembly of the loop with -Os:

Source https://stackoverflow.com/questions/70316686

QUESTION

Do 32-bit and 64-bit registers cause differences in CPU micro architecture?

Asked 2021-Nov-27 at 20:04

I am trying to compare the methods mentioned by Peter Cordes in his answer to the question that 'set all bits in CPU register to 1'.

Therefore, I write a benchmark to set all 13 registers to all bits 1 except e/rsp, e/rbp, and e/rcx.

The code is like below. times 32 nop is used to avoid DSB and LSD influence.

...

ANSWER

Answered 2021-Nov-27 at 20:04

The bottleneck in all of your examples is the predecoder.

I analyzed your examples with my simulator uiCA (https://uica.uops.info/, https://github.com/andreas-abel/uiCA). It predicts the following throughputs, which closely match your measurements:

TP Link g1a 13.00 https://uica.uops.info/?code=... g1b 14.00 https://uica.uops.info/?code=... g2a 16.00 https://uica.uops.info/?code=... g2b 17.00 https://uica.uops.info/?code=... g3a 17.00 https://uica.uops.info/?code=... g3b 18.00 https://uica.uops.info/?code=... g4a 12.00 https://uica.uops.info/?code=... g4b 12.00 https://uica.uops.info/?code=...

The trace table that uiCA generates provides some insights into how the code is executed. For g1a, for example, it generates the following trace:

You can see that for the 32 nops, the predecoder requires 8 cycles, and for the remaining instructions, it requires 5 cycles, which together corresponds to the 13 cycles that you measured.

You may notice that in some cycles, only a small number of instructions is predecoded; for example, in the fourth cycle, only one instruction is predecoded. This is because the predecoder works on aligned 16-byte blocks, and it can handle at most five instructions per cycle (note that some sources incorrectly claim that it can handle 6 instructions per cycle). You can find more details on the predecoder, for example how it handles instructions that cross a 16-byte boundary, in this paper.

If you compare this trace with the trace for g1b, you can see that the instructions after the nops now require 6 instead of 5 cycles to be predecoded, which is because several of the instructions in g1b are longer than the corresponding ones in g1a.

Source https://stackoverflow.com/questions/70131766

QUESTION

What does std::atomic::is_always_lock_free = true really mean?

Asked 2021-Nov-17 at 07:06

I have the following code:

...

ANSWER

Answered 2021-Nov-17 at 06:27

"Lock" here is in the sense of "mutex", not specifically in reference to the x86 instruction prefix named lock.

A trivial and generic way to implement std::atomic for arbitrary types T would be as a class containing a T member together with a std::mutex, which is locked and unlocked around every operation on the object (load, store, exchange, fetch_add, etc). Those operations can then be done in any old way, and need not use atomic machine instructions, because the lock protects them. This implementation would be not lock free.

A downside of such an implementation, besides being slow in general, is that if two threads try to operate on the object at the same time, one of them will have to wait for the lock, which may actually block and cause it to be scheduled out for a while. Or, if a thread gets scheduled out while holding the lock, every other thread that wants to operate on the object will have to wait for the first thread to get scheduled back in and complete its work first.

So it is desirable if the machine supports truly atomic operations on T: a single instruction or sequence that other threads cannot interfere with, and which will not block other threads if interrupted (or perhaps cannot be interrupted at all). If for some type T the library has been able to specialize std::atomic with such an implementation, then that is what we mean by saying it is lock free. (It is just confusing on x86 because the atomic instructions used for such implementations are named lock. On other architectures they might be called something else, e.g. ARM64's ldxr/stxr exclusive load/store instructions.)

The C++ standard allows for types to be "sometimes lock free": maybe it is not known at compile time whether std::atomic will be lock-free, because it depends on special machine features that will be detected at runtime. It's even possible that some objects of type std::atomic are lock-free and others are not. That's why atomic_is_lock_free is a function and not a constant. It checks whether this particular object is lock-free on this particular day.

However, it might be the case for some implementations that certain types can be guaranteed, at compile time, to always be lock free. That's what is_always_lock_free is used to indicate, and note that it's a constexpr bool instead of a function.

Source https://stackoverflow.com/questions/69999379

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install rsp

Run in a source directory:. Windows note: make sure you have Python3 installed and executable locations added into your PATH enviroment variable (it’s an option available during install). Also make sure to run install command from Administrator command line. [![Get it from the Snap Store](https://snapcraft.io/static/images/badges/en/snap-store-black.svg)](https://snapcraft.io/rsp). Note: in snap version rsp, rsp-trust and rsp-keygen binaries have names rsp.proxy, rsp.trust and rsp.keygen respectively.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: