corde | π€ A simple e2e library for testing Discord Bots π | Bot library
kandi X-RAY | corde Summary
kandi X-RAY | corde Summary
Corde is a small testing library for Discord.js. As there is a tool to create bots for Discord, it's cool to also have a tool to test them. Corde objective is to be simple, fast, and readable to developers.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of corde
corde Key Features
corde Examples and Code Snippets
Community Discussions
Trending Discussions on corde
QUESTION
I'm creating an int (32 bit) vector with 1024 * 1024 * 1024 elements like so:
...ANSWER
Answered 2021-Jun-14 at 17:01Here are some techniques.
Loop UnrollingQUESTION
I have read that when accessing with a stride
...ANSWER
Answered 2021-May-04 at 14:03Re: the ultimate question: int_fast16_t
is garbage for arrays because glibc on x86-64 unfortunately defines it as a 64-bit type (not 32-bit), so it wastes huge amounts of cache footprint. The question is "fast for what purpose", and glibc answered "fast for use as array indices / loop counters", apparently, even though it's slower to divide, or to multiply on some older CPUs (which were current when the choice was made). IMO this was a bad design decision.
- Cpp uint32_fast_t resolves to uint64_t but is slower for nearly all operations than a uint32_t (x86_64). Why does it resolve to uint64_t?
- How should the [u]int_fastN_t types be defined for x86_64, with or without the x32 ABI?
- Why are the fast integer types faster than the other integer types?
Generally using small integer types for arrays is good; usually cache misses are a problem so reducing your footprint is nice even if it means using a movzx
or movsx
load instead of a memory source operand to use it with an int
or unsigned
32-bit local. If SIMD is ever possible, having more elements per fixed-width vector means you get more work done per instruction.
But unfortunately int_fast16_t
isn't going to help you achieve that with some libraries, but short
will, or int_least16_t
.
See my comments under the question for answers to the early part: 200 stall cycles is latency, not throughput. HW prefetch and memory-level parallelism hide that. Modern Microprocessors - A 90 Minute Guide! is excellent, and has a section on memory. See also What Every Programmer Should Know About Memory? which is still highly relevant in 2021. (Except for some stuff about prefetch threads.)
Your Update 2 with a faster PRNGRe: why L2 isn't slower than L1: out-of-order exec is sufficient to hide L2 latency, and even your LGC is too slow to stress L2 throughput. It's hard to generate random numbers fast enough to give the available memory-level parallelism much trouble.
Your Skylake-derived CPU has an out-of-order scheduler (RS) of 97 uops, and a ROB size of 224 uops (like https://realworldtech.com/haswell-cpu/3 but larger), and 12 LFBs to track cache lines it's waiting for. As long as the CPU can keep track of enough in-flight loads (latency * bandwidth), having to go to L2 is not a big deal. Ability to hide cache misses is one way to measure out-of-order window size in the first place: https://blog.stuffedcow.net/2013/05/measuring-rob-capacity
Latency for an L2 hit is 12 cycles (https://www.7-cpu.com/cpu/Skylake.html). Skylake can do 2 loads per clock from L1d cache, but not from L2. (It can't sustain 1 cache line per clock IIRC, but 1 per 2 clocks or even somewhat better is doable).
Your LCG RNG bottlenecks your loop on its latency: 5 cycles for power-of-2 array sizes, or more like 13 cycles for non-power-of-2 sizes like your "L3" test attempts1. So that's about 1/10th the access rate that L1d can handle, and even if every access misses L1d but hits in L2, you're not even keeping more than one load in flight from L2. OoO exec + load buffers aren't even going to break a sweat. So L1d and L2 will be the same speed because they both user power-of-2 array sizes.
note 1: imul(3c) + add(1c) for x = a * x + c
, then remainder = x - (x/m * m)
using a multiplicative inverse, probably mul
(4 cycles for high half of size_t?) + shr(1) + imul(3c) + sub(1c). Or with a power-of-2 size, modulo is just AND with a constant like (1UL<.
Clearly my estimates aren't quite right because your non-power-of-2 arrays are less than twice the times of L1d / L2, not 13/5 which my estimate would predict even if L3 latency/bandwidth wasn't a factor.
Running multiple independent LCGs in an unrolled loop could make a difference. (With different seeds.) But a non-power-of-2 m
for an LCG still means quite a few instructions so you would bottleneck on CPU front-end throughput (and back-end execution ports, specifically the multiplier).
An LCG with multiplier (a
) = ArraySize/10
is probably just barely a large enough stride for the hardware prefetcher to not benefit much from locking on to it. But normally IIRC you want a large odd number or something (been a while since I looked at the math of LCG param choices), otherwise you risk only touching a limited number of array elements, not eventually covering them all. (You could test that by storing a 1
to every array element in a random loop, then count how many array elements got touched, i.e. by summing the array, if other elements are 0.)
a
and c
should definitely not both be factors of m
, otherwise you're accessing the same 10 cache lines every time to the exclusion of everything else.
As I said earlier, it doesn't take much randomness to defeat HW prefetch. An LCG with c=0
, a=
an odd number, maybe prime, and m=UINT_MAX
might be good, literally just an imul
. You can modulo to your array size on each LCG result separately, taking that operation off the critical path. At this point you might as well keep the standard library out of it and literally just unsigned rng = 1;
to start, and rng *= 1234567;
as your update step. Then use arr[rng % arraysize]
.
That's cheaper than anything you could do with xorshift+ or xorshft*.
Benchmarking cache latency:
You could generate an array of random uint16_t
or uint32_t
indices once (e.g. in a static initializer or constructor) and loop over that repeatedly, accessing another array at those positions. That would interleave sequential and random access, and make code that could probably do 2 loads per clock with L1d hits, especially if you use gcc -O3 -funroll-loops
. (With -march=native
it might auto-vectorize with AVX2 gather instructions, but only for 32-bit or wider elements, so use -fno-tree-vectorize
if you want to rule out that confounding factor that only comes from taking indices from an array.)
To test cache / memory latency, the usual technique is to make linked lists with a random distribution around an array. Walking the list, the next load can start as soon as (but not before) the previous load completes. Because one depends on the other. This is called "load-use latency". See also Is there a penalty when base+offset is in a different page than the base? for a trick Intel CPUs use to optimistically speed up workloads like that (the 4-cycle L1d latency case, instead of the usual 5 cycle). Semi-related: PyPy 17x faster than Python. Can Python be sped up? is another test that's dependent on pointer-chasing latency.
QUESTION
following are my files for html, .ts and json . As json data was very extensive therefore i have just added a few states and their cities. my 1st dropdown is showing all states. Now I want to match my 1st dropdown's selected value of state with a key "state" in "cities" object in my json file so i can populate 2nd dropdown with cities relevant to that state. and I want to do this in function "getCitiesForSelectedState". please help me find solution for this.
//.ts file
...ANSWER
Answered 2021-Apr-27 at 16:44You can do it with the $event
parameter.
Make sure to compare your values safely.
If your value is not in the right type or has spaces or unwanted chars, this c.state == val
might not work.
You can use the trim
function to compare your value safely:
c.state.trim() == val.trim()
HTML
QUESTION
While Apache Storm offers several metric types, I am interested in the Topology Metrics, (and not the Cluster Metrics or the Metrics v2. For these, a consumer has to be registered, for example as:
...ANSWER
Answered 2021-Apr-26 at 12:06After looking at the right place, I found the related configuration:
topology.builtin.metrics.bucket.size.secs: 10
is they way to specify that interval in storm.yaml
.
QUESTION
My goal is to load a static structure into the L1D cache. After that performing some operation using those structure members and after done with the operation run invd
to discard all the modified cache lines. So basically I want to use create a secure environment inside the cache so that, while performing operations inside the cache, data will not be leaked into the RAM.
To do this, I have a kernel module. Where I placed some fixed values on the members of a structure. Then I disable preemption, disable cache for all other CPU (except current CPU), disable interrupt, then using __builtin_prefetch()
to load my static structure into the cache. And after that, I overwrite the previously placed fixed values with new values. After that, I execute invd
(to clear the modified cache line) and then enable cache to all other CPUs, enable interrupt & enable preemption. My rationale is, as I'm doing this while in atomic mode, INVD
will remove all the changes. And after coming back from atomic mode, I should see the original fixed values that I have placed previously. That is however not happening. After coming out of the atomic mode, I can see the values, that Used to overwrite the previously placed fixed values. Here is my module code,
It's strange that after rebooting the PC, my output changes, I just don't understand why. Now, I'm not seeing any changes at all. I'm posting the full code including some fix @Peter Cordes suggested,
...ANSWER
Answered 2021-Mar-25 at 15:29It looks very unsafe to call printk
at the bottom of fillCache. You're about to run a few more stores then an invd
, so any modifications printk
makes to kernel data structures (like the log buffer) might get written back to DRAM or might get invalidated if they're still dirty in cache. If some but not all stores make it to DRAM (because of limited cache capacity), you could leave kernel data structures in an inconsistent state.
I'd guess that your current tests with HT disabled show everything working even better than you hoped, including discarding stores done by printk
, as well as discarding the stores done by changeFixedValue
. That would explain the lack of log messages left for user-space to read once your code finishes.
To test this, you'd ideally want to clflush
everything printk did, but there's no easy way to do that. Perhaps wbinvd
then changeFixedValue
then invd
. (You're not entering no-fill mode on this core, so fillCache
isn't necessary for your store / invd idea to work, see below.)
CR0.CD is per-physical-core, so having your HT sibling core disable cache also means CD=1 for the isolated core. So with HT enabled, you were in no-fill mode even on the isolated core.
With HT disabled, the isolated core is still normal.
Compile-time and run-time reorderingasm volatile("invd\n":::);
without a "memory"
clobber tells the compiler it's allowed to reorder it wrt. memory operations. Apparently that isn't the problem in your case, but it's a bug you should fix.
Probably also a good idea to put asm("mfence; lfence" ::: "memory");
right before fillCache
, to make sure any cache-miss loads and stores aren't still in flight and maybe allocating new cache lines while your code is running. Or possibly even a fully serializing instruction like asm("xor %eax,%eax; cpuid" ::: "eax", "ebx", "ecx", "edx", "memory");
, but I don't know of anything that CPUID blocks which mfence; lfence wouldn't.
PREFETCHT0 (into L1d cache) is __builtin_prefetch(p,0,3);
. This answer shows how args maps to instructions; you're using prefetchw
(write-intent) or I think prefetcht1
(L2 cache) depending on compiler options.
But really since you need this for correctness, you shouldn't be using optional hints that the HW can drop if it's busy. mfence; lfence
would make it unlikely for the HW to actually be busy, but still not a bad idea.
Use a volatile
read like READ_ONCE
to get GCC to emit a load instruction. Or use volatile char *buf
with *buf |= 0;
or something to truly RMW instead of prefetch, to make sure the line is exclusively owned without having to get GCC to emit prefetchw
.
Perhaps worth running fillCache a couple times, just to make more sure that every line is properly in the state you want. But since your env is smaller than 4k, each line will be in a different set in L1d cache, so there's no risk that one line got tossed out while allocating another (except in case of an alias in L3 cache's hash function? But even then, pseudo-LRU eviction should keep the most-recent line reliably.)
Align your data by 128, an aligned-pair of cache linesstatic struct CACHE_ENV { ... } cacheEnv;
isn't guaranteed to be aligned by the cache line size; you're missing C11 _Alignas(64)
or GNU C __attribute__((aligned(64)))
. So it might be spanning more than sizeof(T)/64
lines. Or for good measure, align by 128 for the L2 adjacent-line prefetcher. (Here you can and should simply align your buffer, but The right way to use function _mm_clflush to flush a large struct shows how to loop over every cache line of an arbitrary-sized possibly-unaligned struct.)
This doesn't explain your problem, since the only part that might get missed is the last up-to-48 bytes of env.out
. (I think the global struct will get aligned by 16 by default ABI rules.) And you're only printing the first few bytes of each array.
And BTW, overwriting your buffer with 0
via memset after you're done should also keep your data from getting written back to DRAM about as reliably as INVD, but faster. (Maybe a manual rep stosb
via asm to make sure it can't optimize away as a dead store).
No-fill mode might also be useful here to stop cache misses from evicting existing lines. AFAIK, that basically locks down the cache so no new allocations will happen, and thus no evictions. (But you might not be able to read or write other normal memory, although you could leave a result in registers.)
No-fill mode (for the current core) would make it definitely safe to clear your buffers with memset before re-enabling allocation; no risk of a cache miss during that causing an eviction. Although if your fillCache actually works properly and gets all your lines into MESI Modified state before you do your work, your loads and stores will hit in L1d cache without risk of evicting any of your buffer lines.
If you're worried about DRAM contents (rather than bus signals), then clflushopt each line after memset will reduce the window of vulnerability. (Or memcpy from a clean copy of the original if 0
doesn't work for you, but hopefully you can just work in a private copy and leave the orig unmodified. A stray write-back is always possible with your current method so I wouldn't want to rely on it to definitely always leave a large buffer unmodified.)
Don't use NT stores for a manual memset or memcpy: that might flush the "secret" dirty data before the NT store. One option would be to memset(0) with normal stores or rep stosb
, then loop again with NT stores. Or perhaps doing 8x movq normal stores per line, then 8x movnti, so you do both things to the same line back to back before moving on.
If you're not using no-fill mode, it shouldn't even matter whether the lines are cached before you write to them. You just need your writes to be dirty in cache when invd
runs, which should be true even if they got that way from your stores missing in cache.
You already don't have any barrier like mfence between fillCache
and changeFixedValue
, which is fine but means that any cache misses from priming the cache are still in flight when you dirty it.
INVD itself is serializing, so it should wait for stores to leave the store buffer before discarding cache contents. (So putting mfence;lfence
after your work, before INVD, shouldn't make any difference.) In other words, INVD should discard cacheable stores that are still in the store buffer, as well as dirty cache lines, unless committing some of those stores happens to evict anything.
QUESTION
I'm trying to execute "invd" instruction from a kernel module. I have asked a similar question How to execute βinvdβ instruction? previously and from @Peter Cordes's answer, I understand I can't safely run this instruction on SMP system after system boot. So, shouldn't I be able to run this instruction after boot without SMP support? Because there is no other core running, therefore there is no change for memory inconsistency? I have the following kernel module compiled with -o0
flag,
ANSWER
Answered 2021-Mar-13 at 22:45There's 2 questions here:
a) How to execute INVD (unsafely)
For this, you need to be running at CPL=0, and you have to make sure the CPU isn't using any "processor reserved memory protections" which are part of Intel's Software Guard Extensions (an extension to allow programs to have a shielded/private/encrypted space that the OS can't tamper with, often used for digital rights management schemes but possibly usable for enhancing security/confidentiality of other things).
Note that SGX is supported in recent versions of Linux, but I'm not sure when support was introduced or how old your kernel is, or if it's enabled/disabled.
If either of these isn't true (e.g. you're at CPL=3 or there are "processor reserved memory protections) you will get a general protection fault exception.
b) How to execute INVD Safely
For this, you have to make sure that the caches (which includes "external caches" - e.g. possibly including things like eDRAM and caches built into non-volatile RAM) don't contain any modified data that will cause problems if lost. This includes data from:
IRQs. These can be disabled.
NMI and machine check exceptions. For a running OS it's mostly impossible to stop/disable these and if you can disable them then it's like crossing your fingers while ignoring critical hardware failures (an extremely bad idea).
the firmware's System Management Mode. This is a special CPU mode the firmware uses for various things (e.g. ECC scrubbing, some power management, emulation of legacy devices) that't beyond the control of the OS/kernel. It can't be disabled.
writes done by the CPU itself. This includes updating the accessed/dirty flags in page tables (which can not be disabled), plus any performance monitoring or debugging features that store data in memory (which can be "not enabled").
With these restrictions (and not forgetting the performance problems) there are only 2 cases where INVD might be sane - early firmware code that needs to determine RAM chip sizes and configure memory controllers (where it's very likely to be useful/sane), and the instant before the computer is turned off (where it's likely to be pointless).
Guesswork
I'm guessing (based on my inability to think of any other plausible reason) that you want to construct temporary shielded/private area of memory (to enhance security - e.g. so that the data you put in that area won't/can't leak into RAM). In this case (ironically) it's possible that the tool designed specifically for this job (SGX) is preventing you from doing it badly.
QUESTION
I have a lot of trouble to make it work:
I have tried the following ways:
...ANSWER
Answered 2021-Mar-08 at 16:31I'm not aware of any good way to do this, I recommend AT&T syntax for GNU C inline asm (or dialect-alternatives add {%1,%0 | %0,%1}
so it works both ways for GCC.) Options like -masm=intel
don't get clang to substitute in bare register names the way they do for GCC.
How to generate assembly code with clang in Intel syntax? is about the syntax used for -S
output, and unlike GCC it's not connected to the syntax for inline-asm input to the compiler. The behaviour of --x86-asm-syntax=intel
hasn't changed: it still outputs in Intel syntax, and doesn't help you with inline asm.
You can abuse %V0
or %V[i]
(instead of %0
or %[i]
) to print the "naked" full-register name in the template https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86Operandmodifiers, but that sucks because it only prints the full register name. Even for a 32-bit int that picked EAX, it will print RAX instead of EAX.
(It also doesn't work for "m"
memory operands to get dword ptr [rsp + 16]
or whatever compiler's choice of addressing mode, but it's better than nothing. Although IMO it's not better than just using AT&T syntax.)
Or you could pick hard registers like "=a"(var)
and then just explicitly use EAX instead of %0
. But that's worse and defeats some of the optimization benefit of the constraint system.
You do still need ".intel_syntax noprefix\n"
in your template, and you should end your template with ".att_syntax"
to switch the assembler back to AT&T mode to assemble the later compiler-generated asm. (Needed if you want your code to work with GCC! clang's built-in assembler doesn't merge your inline asm text into one big asm text file before assembling, it goes straight to machine code for compiler-generated instructions.)
Obviously telling the compiler it can pick any register with "=r"
, and then actually using your own hard-coded choices, will create undefined behaviour when the compiler picks differently. You'll step on the compilers toes and corrupt values it wanted to use later, and have it take garbage from the wrong registers as the output. IDK why you bothered to include that in your question; that would break in exactly the same way for AT&T syntax for the same fairly obvious reason.
QUESTION
I was trying to use pytesseract
to find the box positions of each letter in an image. I tried to use an image, and cropping it with Pillow and it worked, but when I tried with a lower character size image (example), the program may recognize the characters, but cropping the image with the box coordinates give me images like this. I also tried to double up the size of the original image, but it changed nothing.
ANSWER
Answered 2021-Mar-05 at 13:57If we stick to the source code of image_to_boxes
, we see, that the returned coordinates are in the following order:
QUESTION
I came across a code where the assembly program can check if the string is a palindrome, except the string was hardcoded. I wanted to practice and modify the code to do the same thing except that the program will take a user's input for the string.
I got the main code from this post: Palindrome using NASM: answer by user1157391
Below is the code I came up with, except I keep getting the errors: line 39: invalid operand type and line 41: division operator may only be applied to scalar values
...ANSWER
Answered 2020-Dec-15 at 15:14The read
system call returns the number of bytes read in EAX. (Or a negative errno code like -EFAULT
if you passed a bad pointer, -EBADF
if the fd isn't open, etc.) In a real program you'd write error-handling code (and maybe a retry until you actually get to EOF in case read returns early from a long input), but in a toy program you can assume that one read system call succeeds and gets all the data you want to look at.
This data is not necessarily 0-terminated, because you passed the full buffer size to read
1. It could have stored a non-zero input byte in the last character of the buffer. You can't strlen
the buffer to find the length without maybe reading past the end.
But fortunately you don't need to, remember read
leaves the input length in EAX. 2
So after the read syscall, add eax, ecx
makes EAX a pointer to one-past-the-end of the string (like C read() + msg
), while ECX is still pointing at the read arg. So you're all ready to loop them towards each other until they cross, the standard palindrome-checking algorithm.
Use cmp
/jnb
as the loop condition at the bottom of your palindrome loop, not the slow loop
instruction. This is simpler than calculating how many iterations it will take for the pointers to cross; just loop until p < q
is false, where p=start; q=end
initially. Since this is homework, I'll let you choose the args to cmp
.
Using a pointer to one past the end of the input, like C++ std::vector::end()
is fairly common. You'd use it by dec eax
/ movzx edx, byte [eax]
- decrement the pointer before reading it.)
(Or if you really want, work out the sub
/shr
details to make a counted loop with the read return value.)
Another complication: your input may include a newline. If you typed 10 bytes before newline, then the read buffer would only have those characters, no newline.
But on a shorter input, the buffer would hold a newline (0xa), which would compare unequal to the first byte. You might want to loop the end pointer (EAX) backwards until you find a non-newline, instead of special-casing cmp eax, length
before adding. This will leave you with a pointer to the last byte, not one-past-last, so after doing this the main palindrome loop should load before decrementing the pointer.
Footnote 1: Actually you passed 11, so read itself can write past the end of your buffer. If you'd used length equ $-input
to get NASM to calculate the length for you, or length equ 10
/ input: resb length
, you wouldn't have this problem and wouldn't have the length hard-coded in multiple places. You would mov edx, length
before the read system call.
It makes no sense to reserve 10 bytes of space for length
with length: resb 10
. If anything you'd want 4 bytes (a dword integer), but it's a waste of instruction to keep it in memory at all. You're not close to running out of registers.
Footnote 2: It's really dumb that C functions like fgets
don't tell you how many bytes they read, but fortunately the Unix system-call API doesn't suck. It's normal to know how large your data is, so take advantage of pointer+length instead of calling or implementing strlen whenever possible.
Some of those parts of the C library date back to very early C history, like maybe before it was called C. This partly explains the weird design of functions like fopen
that take a string instead of an OR of bit constants (Why does C's "fopen" take a "const char *" as its second argument?), and the bad design of functions like strcpy
which finds the length but chooses not to return it. (strcpy() return value). It's like the library designers hated efficiency, or valued code-size to the extreme (always pass around implicit-length strings, never keep track of their lengths), or didn't realize that rolling your own copy loops when you do want the end wouldn't be viable. (simple portable C compiles to slower asm than hand-written string functions.)
QUESTION
I am trying to implement strcmp which is a C
function in assembly 64, here is my working code so far :
ANSWER
Answered 2020-Dec-10 at 19:20strcmp()
only guarantees the sign of the result. Something probably got optimized in the second case. You don't need to care that the magnitude is different, so it would be best if you didn't.
The compiler would be within its rights to optimize
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install corde
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page