lsd | LSD is a streaming daemon
kandi X-RAY | lsd Summary
kandi X-RAY | lsd Summary
LSD is a streaming daemon that has been developed as a replacement for facebook's scribe (it has been abandoned).
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of lsd
lsd Key Features
lsd Examples and Code Snippets
def lsd_sort(string_list, width):
"""
LSD (least significant digit) algorithm implementation. This algorithm can sort
strings with certain length. LSD algorithm need to access arrays about ~7WN + 3WR times
(W is string's length, N i
function lsd(arr, letterIdx) {
var temp;
var count;
letterIdx = letterIdx || 1;
for (var i = letterIdx - 1; i >= 0; i -= 1) {
count = [];
temp = [];
for (var j = 0; j < arr.length; j += 1) {
var charCod
Community Discussions
Trending Discussions on lsd
QUESTION
I have a function in Pinescript that returns a value based on several indicators:
...ANSWER
Answered 2022-Mar-15 at 01:23By declaring msgLongBuy with var it initializes on the first bar in history and doesn't recalculate, which means the message doesn't stay dynamic. Try removing var from your string declaration
QUESTION
I am benchmarking the following code for (T& x : v) x = x + x;
where T is int
.
When compiling with mavx2
Performance fluctuates 2 times depending on some conditions.
This does not reproduce on sse4.2
I would like to understand what's happening.
How does the benchmark workI am using Google Benchmark. It spins the loop until the point it is sure about the time.
The main benchmarking code:
...ANSWER
Answered 2022-Feb-12 at 20:11Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.
But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".
ld_blocks.no_sr
counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.
When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)
- https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
- Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
- How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.
The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr
tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.
You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)
On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.
Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.
Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.
Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops
. (A significant number aren't, but not a big percentage difference between slow vs. fast.)
How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops
isn't closer to your uops_issued.any
.
QUESTION
Let's say we have a pandas dataframe:
...ANSWER
Answered 2022-Jan-31 at 18:40Use .loc
to get rows by label and .iloc
to get rows by position:
QUESTION
So I'm trying to reproduce a cool filter I did a while back in C# (emgucv) in Python cv2. Despite my hopes it's not going very smoothly. The programs suppose to highlight edges and color them with a cool looking gradient.
The code in C#:
...ANSWER
Answered 2022-Jan-09 at 05:55I think this is what you are trying to do in Python/OpenCV. Python HSV hue is limited to range 0 to 180 so your angle needs to be scaled to that range. Similarly the magnitude is greater than 255 and also needs to be scaled to the range 0 to 255. The saturation you want would be a constant 255. I use Skimage to do the scaling. I have printed out the shape and min and max values at various places to show you these issues.
I believe the process is as follows:
- Read the input
- Convert it to gray
- Get the Sobel x and y derivatives
- Compute the magnitude and angle from the derivatives and scale mag to range 0 to 255 and angle to range 0 to 180
- Merge the angle, the magnitude and the magnitude into a 3 channel image as if HSV with angle first, then the magnitudes.
- Replace the second channel (channel 1) with 255 for the saturation
- Convert this HSV image to BGR as the result
- Save the result
Input:
QUESTION
I want to convert this piece of code in order to make it compatible with Numba. The only sort method that Numba support is sorted() but not with the key arg. I have to manualy sort without other lib imports or maybe just some numpy. Someone could give me an efficient way to do this sort ? Thanks
...ANSWER
Answered 2021-Dec-30 at 22:15For this particular kind of input, you can achieve the sorting with:
QUESTION
I am trying to compare the methods mentioned by Peter Cordes in his answer to the question that 'set all bits in CPU register to 1'.
Therefore, I write a benchmark to set all 13 registers to all bits 1 except e/rsp
, e/rbp
, and e/rcx
.
The code is like below. times 32 nop
is used to avoid DSB and LSD influence.
ANSWER
Answered 2021-Nov-27 at 20:04The bottleneck in all of your examples is the predecoder.
I analyzed your examples with my simulator uiCA (https://uica.uops.info/, https://github.com/andreas-abel/uiCA). It predicts the following throughputs, which closely match your measurements:
TP Link g1a 13.00 https://uica.uops.info/?code=... g1b 14.00 https://uica.uops.info/?code=... g2a 16.00 https://uica.uops.info/?code=... g2b 17.00 https://uica.uops.info/?code=... g3a 17.00 https://uica.uops.info/?code=... g3b 18.00 https://uica.uops.info/?code=... g4a 12.00 https://uica.uops.info/?code=... g4b 12.00 https://uica.uops.info/?code=...The trace table that uiCA generates provides some insights into how the code is executed. For g1a, for example, it generates the following trace:
You can see that for the 32 nops, the predecoder requires 8 cycles, and for the remaining instructions, it requires 5 cycles, which together corresponds to the 13 cycles that you measured.
You may notice that in some cycles, only a small number of instructions is predecoded; for example, in the fourth cycle, only one instruction is predecoded. This is because the predecoder works on aligned 16-byte blocks, and it can handle at most five instructions per cycle (note that some sources incorrectly claim that it can handle 6 instructions per cycle). You can find more details on the predecoder, for example how it handles instructions that cross a 16-byte boundary, in this paper.
If you compare this trace with the trace for g1b, you can see that the instructions after the nops now require 6 instead of 5 cycles to be predecoded, which is because several of the instructions in g1b are longer than the corresponding ones in g1a.
QUESTION
I am using LSD: LineSegmentDetector in python and OpenCV, now the problem is I want to count num of horizontal lines detected and number of vertical lines detected.
...ANSWER
Answered 2021-Nov-15 at 09:34As you know the coordinates of the endpoints, you could simply compute the slope of the line with
QUESTION
I'm trying to verify the conclusion that two fuseable pairs can be decoded in the same clock cycle, using my Intel i7-10700 and ubuntu 20.04.
The test code is arranged like below, and it is copied like 8000 times to avoid the influence of LSD and DSB (to use MITE mostly).
...ANSWER
Answered 2021-Nov-12 at 13:08On Haswell and later, yes. On Ivy Bridge and earlier, no.
On Ice Lake and later, Agner Fog says macro-fusion is done right after decode, instead of in the decoders which required the pre-decoders to send the right chunks of x86 machine code to decoders accordingly. (And Ice Lake has slightly different restrictions: Instructions with a memory operand cannot fuse, unlike previous CPU models. Instructions with an immediate operand can fuse.) So on Ice Lake, macro-fusion doesn't let the decoders handle more than 5 instructions per clock.
Wikichip claims that only 1 macro-fusion per clock is possible on Ice Lake, but that's probably incorrect. Harold tested with my microbenchmark on Rocket Lake and found the same results as Skylake. (Rocket Lake uses a Cypress Cove core, a variant of Sunny Cove back-ported to a 14nm process, so it's likely that it's the same as Ice Lake in this respect.)
Your results indicate that uops_issued.any
is about half instructions
, therefore you are seeing macro-fusion of most pairs. (You could also look at the uops_retired.macro_fused
perf event. BTW, modern perf
has symbolic names for most uarch-specific events: use perf list
to see them.)
The decoders will still produce up-to-four or even five uops per clock on Skylake-derived microarchitectures, though, even if they only make two macro-fusions. You didn't look at how many cycles MITE is active, so you can't see that execution stalls most of the time, until there's room in the ROB / RS for an issue-group of 4 uops. And that opens up space in the IDQ for a decode group from MITE.
You have three other bottlenecks in your loop:Loop-carried dependency through
dec ecx
: only 1/clock because eachdec
has to wait for the result of the previous to be ready.Only one taken branch can execute per cycle (on port 6), and
dec
/jge
is taken almost every time, except for 1 in 2^32 when ECX was 0 before the dec.
The other branch execution unit on port 0 only handles predicted-not-taken branches. https://www.realworldtech.com/haswell-cpu/4/ shows the layout but doesn't mention that limitation; Agner Fog's microarch guide does.Branch prediction: even jumping to the next instruction, which is architecturally a NOP, is not special cased by the CPU. Slow jmp-instruction (Because there's no reason for real code to do this, except for
call +0
/pop
which is special cased at least for the return-address predictor stack.)This is why you're executing at significantly less than one instruction per clock, let alone one uop per clock.
Surprisingly to me, MITE didn't go on to decode a separate test
and jcc
in the same cycle as it made two fusions. I guess the decoders are optimized for filling the uop cache. (A similar effect on Sandybridge / IvyBridge is that if the final uop of a decode-group is potentially fusable, like dec
, decoders will only produce 3 uops that cycle, in anticipation of maybe fusing the dec
next cycle. That's true at least on SnB/IvB where the decoders can only make 1 fusion per cycle, and will decode separate ALU + jcc uops if there is another pair in the same decode group. Here, SKL is choosing not to decode a separate test
uop (and jcc
and another test
) after making two fusions.)
QUESTION
In, say, classical Radix Sort implementation we start to sort an array of integers from the right to the left, that is starting from LSD. My question is, should we even sort the leftmost column if at the next iteration all its values will be sorted again? Can one start sorting from the second column from the end?
You can find example of what I've meant at this page: https://s3.stackabuse.com/media/articles/radix-sort-in-python-4.png
EDIT: rightmost, but not leftmost.
...ANSWER
Answered 2021-Oct-14 at 08:49Not the leftmost but rightmost (Least Significant Digit).
Yes, we must sort by the rightmost digit at the first stage because at the second stage we consider only the second digit.
For example, if we have [15 13]
array and want to sort by the second digit (second from the right - 1's) only - there is no need to swap elements (looking at equal 1's), and array remains the same - unsorted...
QUESTION
I am trying to import data with a foreign key following the guide from the Django import-export library (foreign key widget). But I am getting the following error , I have tried to add an additional column with the header name id but I still get the same error.
...ANSWER
Answered 2021-Sep-27 at 13:00This issue comes up fairly frequently, so I'll try to give a comprehensive answer which might help others in future.
When you are importing a file using django-import-export
, the file is going to be processed row by row. For each row, the import process is going to test whether the row corresponds to an existing stored instance, or whether a new instance is to be created.
In order to test whether the instance already exists, django-import-export
needs to use a field (or a combination of fields) in the row being imported. The idea is that the field (or fields) will uniquely identify a single instance of the model type you are importing.
This is where the import_id_fields
meta attribute comes in. You can use this declaration to indicate which field (or fields) should be used to uniquely identify the row. If you don't declare import_id_fields
, then a default declaration is used, in which there is only one field: 'id'.
So we can now see the source of your error - the import process is trying to use the default 'id' field, but there is no corresponding field in your row.
To fix this, you will either need to include the 'id' field in your csv field, or if this is not possible, then choose some other field (or fields) that will uniquely identify the row.
In either case, ensure that you declare this field (or fields) in your fields
attribute, for example:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install lsd
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page