cmp | Consent Management Platform Reference Implementation

by appnexus JavaScript Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | cmp Summary

cmp is a JavaScript library. cmp has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can install using 'npm i appnexus-cmp' or download it from GitHub, npm.

This sample CMP was designed to facilitate support for the initial adoption of TCF 1.0, and is not being actively maintained, and will not be updated to support TCF 2.0. We strongly recommend that you either adopt a commercial CMP or another open source alternative, then register the CMP with the IAB Europe to be recognized as sending valid signals in the advertising ecosystem. AppNexus requires the use of a CMP registered with the IAB Europe if using the TCF for GDPR/ePrivacy.

Support

Quality

Security

License

Reuse

Support

cmp has a low active ecosystem.

It has 83 star(s) with 48 fork(s). There are 18 watchers for this library.

It had no major release in the last 6 months.

There are 8 open issues and 42 have been closed. On average issues are closed in 15 days. There are 20 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of cmp is current.

Quality

cmp has 0 bugs and 0 code smells.

Security

cmp has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

cmp code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

cmp is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

cmp releases are not available. You will need to build from source code and install.

Deployable package is available in npm.

Installation instructions, examples and code snippets are available.

cmp saves you 617 person hours of effort in developing the same functionality from scratch.

It has 1436 lines of code, 0 functions and 108 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of cmp

Get all kandi verified functions for this library.

cmp Key Features

No Key Features are available at this moment for cmp.

cmp Examples and Code Snippets

No Code Snippets are available at this moment for cmp.

Community Discussions

Trending Discussions on cmp

Why is the XOR swap optimized into a normal swap using the MOV instruction?

Julia - trying to compare two strings

How to sort items inside a vector

Bubble sort slower with -O3 than -O2 with GCC

Why does iteration over an inclusive range generate longer assembly in Rust?

Why `sort()` need `T` to be `Ord`?

Why is `PartialOrd` not blanket-implemented for all types that implement `Ord`?

Lifetime issue with From<&V> trait constraint

Missed optimization with string_view::find_first_of

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

QUESTION

Why is the XOR swap optimized into a normal swap using the MOV instruction?

Asked 2022-Mar-08 at 10:00

While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:

...

ANSWER

Answered 2022-Mar-08 at 10:00

Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:

Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg already does that better.

I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.

Hopefully sometimes that lets it then optimize down to only a single mov, or two movs, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov instructions.

But worst case, three mov instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.

(Apparently GCC -Os does have a peephole optimization to use xchg reg,reg instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)

xor-swap has worse latency and defeats mov-elimination

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)

It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a to output a for xor-swap, and 2 from input b to output a, for example.

MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)

Also related:

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD xchg reg,reg is only 2 uops.

If you're going to branch, just duplicate the averaging code

GCC's real missed optimization here (even with -O3) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.

update: GCC will do this for you with -ftracer (https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use, not at -O3, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)

Doing it manually in the source (Godbolt):

Source https://stackoverflow.com/questions/71382441

QUESTION

Julia - trying to compare two strings

Asked 2022-Feb-17 at 16:42

I am trying to compare two strings I got as input, but it printed at me an error like "syntax: unexpected "="".

...

ANSWER

Answered 2022-Feb-17 at 16:42

In Julia, you should use == for comparison: https://docs.julialang.org/en/v1/manual/mathematical-operations/#Numeric-Comparisons

This is different from assignment operator =, so for example

Source https://stackoverflow.com/questions/71155941

QUESTION

How to sort items inside a vector

Asked 2022-Jan-31 at 17:33

I have a vector [(5, 1), (9, 1), (4, 2)]

I want it to be sorted as [(4, 2), (9, 1), (5, 1)]

Which is sorted in descending order by every second element then the first element. Would it be possible using sort_by(|x, y| y.cmp(x)) function?
...

ANSWER

Answered 2022-Jan-30 at 22:59

It is most certainly possible, and you have the right idea using sort_by. You can sort by the second element in basically the way you suggest.

Source https://stackoverflow.com/questions/70919627

QUESTION

Bubble sort slower with -O3 than -O2 with GCC

Asked 2022-Jan-21 at 02:41

I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3 flag made it run even slower than no flags at all! Meanwhile -O2 was making it run a lot faster as expected.

Without optimisations:
...

ANSWER

Answered 2021-Oct-27 at 19:53

It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.

(gcc -O3 enables -ftree-vectorize and a few other options not included by -O2, e.g. if-conversion to branchless cmov, which is another way -O3 can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2, although some of its optimizations are still only on at -O3.)

It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.

(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)

Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)

Even better would be to keep buf[x+1] in a register and use it as buf[x] in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)

If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind / pmaxd comparator might be interesting, but that would mean always storing and the C source doesn't do that.

If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,

Source https://stackoverflow.com/questions/69503317

QUESTION

Why does iteration over an inclusive range generate longer assembly in Rust?

Asked 2022-Jan-15 at 11:19

These two loops are equivalent in C++ and Rust:

...

ANSWER

Answered 2022-Jan-12 at 10:20

Overflow in the iterator state.

The C++ version will loop forever when given a large enough input:

Source https://stackoverflow.com/questions/70672533

QUESTION

Why `sort()` need `T` to be `Ord`?

Asked 2022-Jan-05 at 06:47

https://doc.rust-lang.org/src/alloc/slice.rs.html#268-270

https://doc.rust-lang.org/src/core/cmp.rs.html#1034
...

ANSWER

Answered 2022-Jan-05 at 06:47

Both historical and logical reasons.

Back in the days, before impl [T], sort first used cmp() instead of lt(). That changed about 5 years ago for optimization reasons. At that point, the constraint could have been changed from Ord to PartialOrd. And truly, it sparked another discussion about PartialOrd and Ord.

However, there's a logical reason too: for any two indices i and j within [0..values.len()] and i <= j, you expect the following to hold if values has been sorted:

Source https://stackoverflow.com/questions/70588237

QUESTION

Why is `PartialOrd` not blanket-implemented for all types that implement `Ord`?

Asked 2021-Dec-26 at 13:36

In the documentation for Ord, it says

Implementations must be consistent with the PartialOrd implementation [...]

That of course makes sense and can easily be archived as in the example further down:
...

ANSWER

Answered 2021-Dec-26 at 00:40

Apparently, there is a reference to that, in a github issue - rust-lang/rust#63104:

This conflicts with the existing blanket impl in core.

Source https://stackoverflow.com/questions/70483536

QUESTION

Lifetime issue with From<&V> trait constraint

Asked 2021-Dec-23 at 08:04

The following code produces the lifetime errors below despite the fact that the V instance in question is owned.

...

ANSWER

Answered 2021-Dec-23 at 08:01

Use a higher-rank trait bound denoted by for<'a>:

Source https://stackoverflow.com/questions/70459054

QUESTION

Missed optimization with string_view::find_first_of

Asked 2021-Dec-22 at 07:51

Update: relevant GCC bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103798

I tested the following code:
...

ANSWER

Answered 2021-Dec-21 at 11:08

libstdc++'s std::string_view::find_first_of looks something like:

Source https://stackoverflow.com/questions/70433152

QUESTION

GEMM kernel implemented using AVX2 is faster than AVX2/FMA on a Zen 2 CPU

Asked 2021-Dec-14 at 20:40

I have tried speeding up a toy GEMM implementation. I deal with blocks of 32x32 doubles for which I need an optimized MM kernel. I have access to AVX2 and FMA.

I have two codes (in ASM, I apologies for the crudeness of the formatting) defined below, one is making use of AVX2 features, the other uses FMA.

Without going into micro benchmarks, I would like to try to develop an understanding (theoretical) of why the AVX2 implementation is 1.11x faster than the FMA version. And possibly how to improve both versions.

The codes below are for a 3000x3000 MM of doubles and the kernels are implemented using the classical, naive MM with an interchanged deepest loop. I'm using a Ryzen 3700x/Zen 2 as development CPU.

I have not tried unrolling aggressively, in fear that the CPU might run out of physical registers.

AVX2 32x32 MM kernel:
...

ANSWER

Answered 2021-Dec-13 at 21:36

Zen2 has 3 cycle latency for vaddpd, 5 cycle latency for vfma...pd. (https://uops.info/).

Your code with 8 accumulators has enough ILP that you'd expect close to two FMA per clock, about 8 per 5 clocks (if there aren't other bottlenecks) which is a bit less than the 10/5 theoretical max.

vaddpd and vmulpd actually run on different ports on Zen2 (unlike Intel), port FP2/3 and FP0/1 respectively, so it can in theory sustain 2/clock vaddpd and vmulpd. Since the latency of the loop-carried dependency is shorter, 8 accumulators are enough to hide the vaddpd latency if scheduling doesn't let one dep chain get behind. (But at least multiplies aren't stealing cycles from it.)

Zen2's front-end is 5 instructions wide (or 6 uops if there are any multi-uop instructions), and it can decode memory-source instructions as a single uop. So it might well be doing 2/clock each multiply and add with the non-FMA version.

If you can unroll by 10 or 12, that might hide enough FMA latency and make it equal to the non-FMA version, but with less power consumption and more SMT-friendly to code running on the other logical core. (10 = 5 x 2 would be just barely enough, which means any scheduling imperfections lose progress on a dep chain which is on the critical path. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for some testing on Intel.)

(By comparison, Intel Skylake runs vaddpd/vmulpd on the same ports with the same latency as vfma...pd, all with 4c latency, 0.5c throughput.)

I didn't look at your code super carefully, but 10 YMM vectors might be a tradeoff between touching two pairs of cache lines vs. touching 5 total lines, which might be worse if a spatial prefetcher tries to complete an aligned pair. Or might be fine. 12 YMM vectors would be three pairs, which should be fine.

Depending on matrix size, out-of-order exec may be able to overlap inner loop dep chains between separate iterations of the outer loop, especially if the loop exit condition can execute sooner and resolve the mispredict (if there is one) while FP work is still in flight. That's an advantage to having fewer total uops for the same work, favouring FMA.

Source https://stackoverflow.com/questions/70340734

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities
No vulnerabilities reported

Install cmp
This produces a production build of the cmp script and the docs application:.
./build/cmp.bundle.js - CMP script to include on your site
./build/docs/ - Application hosting the documentation

Support
Instructions to install the CMP as well as API docs and examples are available in the docs application included with the repo. The documentation can be viewed at: http://localhost:5000/docs/.
Find more information at: