sse | HTML5 Server-Sent-Events for Go | Websocket library
kandi X-RAY | sse Summary
kandi X-RAY | sse Summary
HTML5 Server-Sent Events for Go.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- SendBytes sends an event to the channel
- format returns the event string .
- New returns a new Streamer
sse Key Features
sse Examples and Code Snippets
@GetMapping("/stream-sse-mvc")
public SseEmitter streamSseMvc() {
SseEmitter emitter = new SseEmitter();
ExecutorService sseMvcExecutor = Executors.newSingleThreadExecutor();
sseMvcExecutor.execute(() -> {
public static void main(String... args) throws Exception {
Client client = ClientBuilder.newClient();
WebTarget target = client.target(url);
try (SseEventSource eventSource = SseEventSource.target(target).build()) {
public static String simpleSSEHeader() throws InterruptedException {
Client client = ClientBuilder.newBuilder()
.register(AddHeaderOnRequestFilter.class)
.build();
WebTarget webTarget = client.target(T
Community Discussions
Trending Discussions on sse
QUESTION
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
...ANSWER
Answered 2022-Apr-10 at 02:47The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) + v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] += arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
QUESTION
My understanding is the AMD64 was invented by AMD as a 64 bit version of x86.
New instructions are added by both AMD and Intel (because those are the only two companies that implement AMD64). In effect, there is no central standard like there is in C++.
When new instructions are added they are usually a part of a "set" like sse or avx.
In my research, the designation for some instructions is inconsistent, ie its not always clear which set an instruction belongs to
What defines the instruction sets? Is there a universal agreement what what instructions are in which sets or is it decided by convention?
...ANSWER
Answered 2022-Mar-10 at 03:14There is no such thing, and I cannot imagine how there would be.
One or more people at intel define their instruction sets for their products, period. If AMD happens to have been able to make legal clones (which they have) and as part of that agreement or perhaps even not but with some penalty, they add additional instructions/features. First off it is on them to do it and keep some sense of compatibility, if they even want to be compatible. Second if they want to add extensions and can get away with it it is purely within AMD one or more engineers. Then if intel goes and makes some new instructions, it is one or more intel engineers. As history played out you then have other completely disconnected parties like gnu tools folks, microsoft tools folks and a list of others, as well as operating system folks that use tools and make their products, choosing directly or indirectly what instructions get used. And as history plays out some of these intel only or amd only instructions may be favored by one party or another. And if that party happens to have influence (microsoft windows, linux, etc), to the point that it puts pressure on intel or amd to lean one way or another, they it is their management and engineering that does that, within their company. They can choose to not go with what the users want and try to push users in their direction. Simple sales of one product line or another may dictate the success or failure of each parties decisions.
I cannot think of a or many standards that folks actually agree on even though they might have representatives that wear shirts with the same logo on them that participate in the standards bodies. From pcie to java to C++, etc (C and C++ being really bad since they were written then attempts to standardize later, which are just patches and too much left to individual compiler authors choices of interpretation). You want to win at business you differentiate yourself from the others. I have an x86 clone that is much cheaper but performs 95% as well as intel. Plus I added my own stuff that intel does not have that I pay employees to add to open source stuff, making those open source things optional to gain that feature/performance boost. That differentiates me from the competition, and for some users locks me in as their only choice.
Instruction sets for an architecture (x86 has a long line of architectures over time, arm does too and they are more organized about it imo, etc) are defined by that individual or teams within that company. End of story. At best they may have to avoid patents (yep there have been patents you have to avoid, making it hard to make a new instruction set). If two competing and compatible architectures like intel and amd (or intel team a vs intel team b, amd team a vs ...) happen to adopt each others features/instructions it is more market driven not some standards body.
Basically go look at itanium vs amd64 and how that played out.
The x86 history is a bit of a nightmare and I still cannot fathom why it still even exists (has nothing to do with the quality of the instruction set but instead how the business works), and as such attempting to put labels on things and organize them into individual boxes, really does not add any value and creates some chaos. Generation r of intel has this, generation m of amd has that, my tool supports gen r of this and gen m of that. Next year I will personally choose if I want to support the next gen of each or not. Repeat forever until the products die. You also have to choose if you want to support an older generation as those may have the same instructions but with different features/side effects despite in theory being compatible.
QUESTION
I've been doing a good amount of research on AMD64 (x86-64) instructions, and its been kind of confusing. A lot of the time official CPU documentation doesn't designate instruction as part of a specific set, and the internet is sometimes split on which instruction set a specific instruction belongs to. One example of this is SFENCE
, with some sources claiming that it's part of EMMX and others claiming it's part of SSE.
I'm trying to organize all of them in a spreadsheet to help with learning, but these inconsistencies are incredibly frustrating in a field that is famously technical and precise.
...ANSWER
Answered 2022-Mar-03 at 18:00EMMX is a subset of SSE, and sfence
is part of both of them.
AMD did not immediately support all SSE instructions, but at first took a subset of it that did not require the new XMM registers (see near the bottom of the PDF), which became known as EMMX. That included for example pavgb mm0, mm1
(but not pavgb xmm0, xmm1
), and also sfence
.
All instructions that are in EMMX are also in SSE, processors that support SSE can execute EMMX code regardless of whether they "explicitly" support EMMX (which has a dedicated CPUID feature flag). The Zen 1 aka Summit Ridge you linked, supports EMMX implicitly: it does not have the corresponding feature flag set, but since it supports SSE, it also ends up supporting EMMX. Before Zen, AMD processors with SSE used to set the EMMX feature flag as well.
QUESTION
I'm trying to build a Server-Sent Events endpoint with FastAPI but I'm unsure if what I'm trying to accomplish is possible or how I would go about doing it.
Introduction to the problemBasically let's say I have a run_task(limit, task)
async function that sends an async request, makes a transaction, or something similar. Let's say that for each task run_task
can return some JSON data.
I'd like to run multiple tasks (multiple run_task(limit, task)
) asynchronously, to do so I'm using trio and nurseries like so:
ANSWER
Answered 2022-Jan-16 at 23:27I decided to ultimately go with websockets rather than SSE, as I realised I needed to pass an object as data to my endpoint, and while SEE can accept query params, dealing with objects as query parameters was too much of a hassle.
websockets with FastAPI are based on starlette, and are pretty easy to use, implementing them to the problem above can be done like so:
QUESTION
I'm trying to compile my Rust code on my M1 Mac for a x86_64 target with linux. I use Docker to achieve that.
My Dockerfile:
...ANSWER
Answered 2022-Jan-18 at 17:25It looks like the executable is actually named x86_64-linux-gnu-gcc
, see https://packages.debian.org/bullseye/arm64/gcc-x86-64-linux-gnu/filelist.
QUESTION
I wrote a small program to explore out-of-bounds reads vulnerabilities in C to better understand them; this program is intentionally buggy and has vulnerabilities:
...ANSWER
Answered 2021-Dec-31 at 23:21Since stdout
is line buffered, putchar
doesn't write to the terminal directly; it puts the character into a buffer, which is flushed when a newline is encountered. And the buffer for stdout
happens to be located on the heap following your heap_book
allocation.
So at some point in your copy, you putchar
all the characters of your secretinfo
method. They are now in the output buffer. A little later, heap_book[i]
is within the stdout
buffer itself, so you encounter the copy of secretinfo
that is there. When you putchar
it, you effectively create another copy a little further along in the buffer, and the process repeats.
You can verify this in your debugger. The address of the stdout buffer, on glibc, can be found with p stdout->_IO_buf_base
. In my test it's exactly 160 bytes past heap_book
.
QUESTION
This is what I know about SIMD. Single-instruction-multiple-data is a way of processing data that performs the same instruction over vectors of multiple values. SIMD is implemented at different levels depending on the processor of the machine (SSE, SSE2, NEON...), and every level provides a different instruction set.
We can use these instructions sets by including immintrin.h
. What I haven't really understood is: when actually developing something with SIMD, should we care about checking which instruction sets are supported? What are the best practices when developing such programs? What should we do if, for example, an instruction set is not supported; should we provide a non-SIMD alternative or the compiler unvectorises the whole thing for us?
ANSWER
Answered 2021-Dec-19 at 11:10Of course we need to take care which ISA is supported, because if we use an unknown instruction then the program will be killed with a non-supported instruction signal. Besides it allows us to optimize for each architecture, for example on CPUs with AVX-512 we can use AVX-512 for better performance, but if on an older CPU then we can fallback to the appropriate version for that architecture
What are the best practices when developing such programs?
There are no general best practices. It depends on each situation because each compiler has different tools for this
- If your compiler doesn't support dynamic dispatching then you need to write separate code for each ISA and call the corresponding version for the current platform
- Some compilers automatically dispatch to the version optimized for the running platform, for example ICC can compile a hot loop to separate versions of SSE/AVX/AVX-512 and jump to the correct version for maximum performance.
- Some other compilers support compiling to separate versions of a single function and automatically dispatch but you need to specify which function you want to optimize. For example in GCC, Clang and ICC you can use the attributes
target
andtarget_clones
. See Building backward compatible binaries with newer CPU instructions support
QUESTION
- What is the purpose or intention of a MoveMask?
- What's the best place to learn how to use x86/x86-64 assembly/SSE/AVX?
- Could I have written my code more efficiently?
I have an function written in F# for .NET that uses SSE2. I've written the same thing using AVX2 but the underlying question is the same. What is the intended purpose of a MoveMask
? I know that it works for my purposes, I want to know why.
I am iterating through two 64-bit float arrays, a
and b
, testing that all of their values match. I am using the CompareEqual
method (which I believe is wrapping a call to __m128d _mm_cmpeq_pd
) to compare several values at a time. I then compare that result with a Vector128
of 0.0
64-bit float. My reasoning is that the result of CompareEqual
will give a 0.0
value in the cases where the values don't match. Up to this point, it makes sense.
I then use the Sse2.MoveMask
method on the result of the comparison with the zero vector. I've previously worked on using SSE
and AVX
for matching and I saw examples of people using MoveMask
for the purpose for testing for non-zero values. I believe this method is using the int _mm_movemask_epi8
Intel intrinsic. I have included the F# code and the assembly that is JITed.
Is this really the intention of a MoveMask
or is it just a happy coincidence it works for these purposes. I know my code works, I want to know WHY it works.
ANSWER
Answered 2021-Nov-08 at 05:02MoveMask
just extracts the high bit of each element into an integer bitmap. You have 3 element-size options: movmskpd
(64-bit), movmskps
(32-bit), and pmovmskb
(8-bit).
This works well with SIMD compares, which produce an output that has all-zero when the predicate is false, all-one bits in elements where the predicate is true. All-ones is a bit-pattern for -QNaN
if interpreted as an IEEE-FP floating-point value, but normally you don't do that. Instead movemask, or AND, (or AND / ANDN / OR or _mm_blend_pd
) or things like that with a compare result.
movemask(v) != 0
, movemask(v) == 0x3
, or movemask(v) == 0
is how you check conditions like at least one element in a compare matched, or all matched, or none matched, respectively, where v
is the result of _mm_cmpeq_pd
or whatever. (Or just to extract signs directly without a compare).
For other element sizes, 0xf
or 0xffff
to match all four or all 16 bits. Or for AVX 256-bit vectors, twice as many bits, up to filling a whole 32-bit integer with vpmovmskb eax, ymm0
.
What you're doing is really weird, using a 0.0 / NaN compare result as the input to another compare with vcmpeqpd xmm1, xmm1, xmm2
/ vcmpeqpd xmm1, xmm1, xmm0
. For the 2nd comparison, that can only be true for elements that are == 0.0
(i.e. +-0.0), because x == NaN
is false for every x
.
If the second vector is a constant zero (let zeroTest = Sse2.CompareEqual (comparison, zeroVector)
, that's pointless, you're just inverting the compare result which you could have done by checking a different integer condition or against a different constant, not doing runtime comparisons. (0.0 == 0.0
is true, producing an all-ones output, 0.0 == -NaN
is false, producing an all-zero output.)
To learn more about intrinsics and SIMD, see for example Agner Fog's optimization guide; his asm guide has a chapter on SIMD. Also, his VectorClass library for C++ has some useful wrappers, and for learning purposes seeing how those wrapper functions implement some basic things could be useful.
To learn what things actually do, see Intel's intrinsics guide. You can search by asm instruction or C++ intrinsic name.
I think MS has docs for their C# System.Runtime.Intrinsics.X86, and I assume F# uses the same intrinsics, but I don't use either language myself.
Related re: comparisons:
Get the last line separator - pcmpeqb -> pmovmskb ->
bsr
to find the position of the last match element in a vector of compare results. Bit-scan reverse on the compare mask. Often you want to scan forward to find the first match (or invert and find first mismatch, like formemcmp
). e.g. Compare 16 byte strings with SSE
Or popcount them if you're counting occurrences by matching against a loop-invariant vector of a broadcasted character: How can I count the occurrence of a byte in array using SIMD? - instead of movemask, use the compare result as integer 0 / -1. SIMD subtract from a vector accumulator in the inner loop, then horizontal sum of integer elements in an outer loop.SIMD instructions for floating point equality comparison (with NaN == NaN) - useful exercise in understanding how NaNs work.
QUESTION
I've started working with Puppeteer and for some reason I cannot get it to work on my box. This error seems to be a common problem (SO1, SO2) but all of the solutions do not solve this error for me. I have tested it with a clean node package (see reproduction) and I have taken the example from the official Puppeteer 'Getting started' webpage.
How can I resolve this error?
Versions and hardware ...ANSWER
Answered 2021-Nov-24 at 18:42There's too much for me to put this in a comment, so I will summarize here. Maybe it will help you, or someone else. I should also mention this is for RHEL EC2 instances behind a corporate proxy (not Arch Linux), but I still feel like it may help. I had to do the following to get puppeteer working. This is straight from my docs, but I had to hand-jam the contents because my docs are on an intranet.
I had to install all of these libraries manually. I also don't know what the Arch Linux equivalents are. Some are duplicates from your question, but I don't think they all are:
pango
libXcomposite
libXcursor
libXdamage
libXext
libXi
libXtst
cups-libs
libXScrnSaver
libXrandr
GConf2
alsa-lib
atk
gtk3
ipa-gothic-fonts
xorg-x11-fonts-100dpi
xorg-x11-fonts-75dpi
xorg-x11-utils
xorg-x11-fonts-cyrillic
xorg-x11-fonts-Type1
xorg-x11-fonts-misc
liberation-mono-fonts
liberation-narrow-fonts
liberation-narrow-fonts
liberation-sans-fonts
liberation-serif-fonts
glib2
If Arch Linux uses SELinux, you may also have to run this:
setsebool -P unconfirmed_chrome_sandbox_transition 0
It is also worth adding dumpio: true
to your options to debug. Should give you a more detailed output from puppeteer, instead of the generic error. As I mentioned in my comment. I have this option ignoreDefaultArgs: ['--disable-extensions']
. I can't tell you why because I don't remember. I think it is related to this issue, but also could be related to my corporate proxy.
QUESTION
As far as I understand, some objects in the "data" section sometimes need alignment in x86 assembly.
An example I've come across is when using movaps
in x86 SSE: I need to load a special constant for later xor
s into an XMM
register.
The XMM
register is 128 bits wide and I need to load a 128-bit long memory location into it, that would also be aligned at 128 bits.
With trial and error, I've deduced that the code I'm looking for is:
...ANSWER
Answered 2021-Nov-18 at 18:01In which assembly flavors do I use .align instead of align?
Most notably the GNU assembler (GAS) uses .align
, but every assembler can have its own syntax. You should check the manual of whatever assembler you are actually using.
Do I need to write this keyword/instruction before every data object or is there a way to write it just once?
You don't need to write it before each object if you can keep track of the alignment as you go. For instance, in your example, you wrote align 16
and then assembled 4 dwords of data, which is 16 bytes. So following that data, the current address is again aligned to 16 and another align 16
would be unnecessary (though of course harmless). You could write something like
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install sse
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page