optimization | Routing optimization module module for Itinero | Router library
kandi X-RAY | optimization Summary
kandi X-RAY | optimization Summary
A module for solving optimization routing problems with Itinero as a routing solution.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of optimization
optimization Key Features
optimization Examples and Code Snippets
Community Discussions
Trending Discussions on optimization
QUESTION
ANSWER
Answered 2022-Apr-08 at 15:31Consider a typical use case of a std::any
: You pass it around in your code, move it dozens of times, store it in a data structure and fetch it again later. In particular, you'll likely return it from functions a lot.
As it is now, the pointer to the single "do everything" function is stored right next to the data in the any
. Given that it's a fairly small type (16 bytes on GCC x86-64), any
fits into a pair of registers. Now, if you return an any
from a function, the pointer to the "do everything" function of the any
is already in a register or on the stack! You can just jump directly to it without having to fetch anything from memory. Most likely, you didn't even have to touch memory at all: You know what type is in the any
at the point you construct it, so the function pointer value is just a constant that's loaded into the appropriate register. Later, you use the value of that register as your jump target. This means there's no chance for misprediction of the jump because there is nothing to predict, the value is right there for the CPU to consume.
In other words: The reason that you get the jump target for free with this implementation is that the CPU must have already touched the any
in some way to obtain it in the first place, meaning that it already knows the jump target and can jump to it with no additional delay.
That means there really is no indirection to speak of with the current implementation if the any
is already "hot", which it will be most of the time, especially if it's used as a return value.
On the other hand, if you use a table of function pointers somewhere in a read-only section (and let the any
instance point to that instead), you'll have to go to memory (or cache) every single time you want to move or access it. The size of an any
is still 16 bytes in this case but fetching values from memory is much, much slower than accessing a value in a register, especially if it's not in a cache. In a lot of cases, moving an any
is as simple as copying its 16 bytes from one location to another, followed by zeroing out the original instance. This is pretty much free on any modern CPU. However, if you go the pointer table route, you'll have to fetch from memory every time, wait for the reads to complete, and then do the indirect call. Now consider that you'll often have to do a sequence of calls on the any
(i.e. move, then destruct) and this will quickly add up. The problem is that you don't just get the address of the function you want to jump to for free every time you touch the any
, the CPU has to fetch it explicitly. Indirect jumps to a value read from memory are quite expensive since the CPU can only retire the jump operation once the entire memory operation has finished. That doesn't just include fetching a value (which is potentially quite fast because of caches) but also address generation, store forwarding buffer lookup, TLB lookup, access validation, and potentially even page table walks. So even if the jump address is computed quickly, the jump won't retire for quite a long while. In general, "indirect-jump-to-address-from-memory" operations are among the worst things that can happen to a CPU's pipeline.
TL;DR: As it is now, returning an any
doesn't stall the CPU's pipeline (the jump target is already available in a register so the jump can retire pretty much immediately). With a table-based solution, returning an any
will stall the pipeline twice: Once to fetch the address of the move function, then another time to fetch the destructor. This delays retirement of the jump quite a bit since it'll have to wait not only for the memory value but also for the TLB and access permission checks.
Code memory accesses, on the other hand, aren't affected by this since the code is kept in microcode form anyway (in the µOp cache). Fetching and executing a few conditional branches in that switch statement is therefore quite fast (and even more so when the branch predictor gets things right, which it almost always does).
QUESTION
Haskell provides a convenient function forever
that repeats a monadic effect indefinitely. It can be defined as follows:
ANSWER
Answered 2022-Feb-05 at 20:34The execution engine starts off with a pointer to your loop, and lazily expands it as it needs to find out what IO
action to execute next. With your definition of forever
, here's what a few iterations of the loop like like in terms of "objects stored in memory":
QUESTION
I just upgraded an environment with nrwl from angular version 11 to 12 with two angular applications and several libraries. After update when I try to compile using optimization settings:
angular.json
...ANSWER
Answered 2022-Jan-31 at 19:50Reason of the issue
It is expected browserslist to return an entry for each version ("safari 15.2", "safari 15.3") instead of a range ("safari 15.2-15.3"). So, this is just a bug in the parsing logic of Safari browser versions which needs to be corrected and will be done soon in fixed versions of Angular 12/Angular 13. Link to details is here.
IMPORTANT UPDATE:
This is fixed in v12.2.16 and v13.2.1, please update if you are experiencing this issue. Users on v11 shouldn't be affected. Link to details is here. If you can not/do not want to update for any reason, then one of the workarounds below can be used.
Workarounds:
Modify .browserslistrc
Add to .browserslistrc such lines:
QUESTION
Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8
is better than String(byte[] bytes, int offset, int length, Charset charset)
for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen"
.
I tried to figure out why, so I copied the relevant code in String
and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8
.
Here are the JMH benchmark results:
ANSWER
Answered 2022-Jan-12 at 09:52To measure the branch you are interested in and particularly the scenario when while
loop becomes hot, I've used the following benchmark:
QUESTION
I am facing an issue while upgrading my project from angular 8.2.1 to angular 13 version.
After a successful upgrade while preparing a build it is giving me the following error.
...ANSWER
Answered 2021-Dec-14 at 12:45Just remove the "extractCss": true
from your production environment, it will resolve the problem.
The reason about it is extractCss is deprecated, and it's value is true by default. See more here: Extracting CSS into JS with Angular 11 (deprecated extractCss)
QUESTION
I was reading the GCC documentation on C and C++ function attributes. In the description of the error
and warning
attributes, the documentation casually mentions the following "trick":
error ("message")
warning ("message")
If the
error
orwarning
attribute is used on a function declaration and a call to such a function is not eliminated through dead code elimination or other optimizations, an error or warning (respectively) that includes message is diagnosed. This is useful for compile-time checking, especially together with__builtin_constant_p
and inline functions where checking the inline function arguments is not possible throughextern char [(condition) ? 1 : -1];
tricks.While it is possible to leave the function undefined and thus invoke a link failure (to define the function with a message in
.gnu.warning*
section), when using these attributes the problem is diagnosed earlier and with exact location of the call even in presence of inline functions or when not emitting debugging information.
There's no further explanation. Perhaps it's obvious to programmers immersed in the environment, but it's not at all obvious to me, and I could not find any explanation online. What is this technique and when might I use it?
...ANSWER
Answered 2022-Jan-23 at 04:53I believe the premise is to have a compile time assert functionality. Suppose that you wrote
QUESTION
I have written the following very simple code which I am experimenting with in godbolt's compiler explorer:
...ANSWER
Answered 2021-Dec-07 at 09:52The assembly seems to be checking if either num
or den
is larger than 2**32 by shifting right by 32 bits and then checking whether the resulting number is 0.
Depending on the decision, a 64-bit division (div rsi
) or 32-bit division (div esi
) is performed.
Presumably this code is generated because the compiler writer thinks the additional checks and potential branch outweigh the costs of doing an unnecessary 64-bit division.
QUESTION
I was trying to find the source for a certain kind of inlining that happens in GHC, where a function that is passed as an argument to another function is inlined. For example, I may write a definition like the following (using my own List type to avoid rewrite rules):
...ANSWER
Answered 2021-Nov-25 at 10:34The optimization is called "call-pattern specialization" (a.k.a. SpecConstr) this specializes functions according to which arguments they are applied to. The optimization is described in the paper "Call-pattern specialisation for Haskell programs" by Simon Peyton Jones. The current implementation in GHC is different from what is described in that paper in two highly relevant ways:
- SpecConstr can apply to any call in the same module, not just recursive calls inside a single definition.
- SpecConstr can apply to functions as arguments, not just constructors. However, it doesn't work for lambdas, unless they have been floated out by full laziness.
Here is the relevant part of the core that is produced without this optimization, using -fno-spec-constr
, and with the -dsuppress-all -dsuppress-uniques -dno-typeable-binds
flags:
QUESTION
Short version: If s
is a string, then s = s + 'c'
might modify the string in place, while t = s + 'c'
can't. But how does the operation s + 'c'
know which scenario it's in?
Long version:
t = s + 'c'
needs to create a separate string because the program afterwards wants both the old string as s
and the new string as t
.
s = s + 'c'
can modify the string in place if s
is the only reference, as the program only wants s
to be the extended string. CPython actually does this optimization, if there's space at the end for the extra character.
Consider these functions, which repeatedly add a character:
...ANSWER
Answered 2021-Sep-08 at 00:15Here's the code in question, from the Python 3.10 branch (in ceval.c
, and called from the same file's implementation of the BINARY_ADD
opcode). As @jasonharper noted in a comment, it peeks ahead to see whether the result of the BINARY_ADD
will next be bound to the same name from which the left-hand addend came. In fast()
, it is (operand came from s
and result stored into s
), but in slow()
it isn't (operand came from s
but stored into t
).
There's no guarantee this optimization will persist, though. For example, I noticed that your fast()
is no faster than your slow()
on the current development CPython main
branch (which is the current work-in-progress toward an eventual 3.11 release).
As noted, there's no guarantee this optimization will persist. "Serious" Python programmers should know better than to rely on dodgy CPython-specific tricks, and, indeed, PEP 8 explicitly warns against relying on this specific one:
Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form
a += b
ora = a + b
...
QUESTION
I am curious about the vastly different performance characteristics of running x86-64 binaries on the Apple M1 platform using Rosetta 2 vs. emulation, for example what Docker Desktop currently does using QEMU.
I understand why emulation is so slow, but an explanation for why Rosetta 2 is so fast has been detailed in this Twitter thread: https://twitter.com/ErrataRob/status/1331735383193903104
The gist of that explanation is that under usual circumstances, arm and x86 have opposite (and incompatible) memory addressing schemes which require significant emulation overhead, but the M1 chip addresses this with a hardware optimization that allows it to access memory using both addressing schemes. Effectively, when Rosetta 2-emulated instructions are being run, a flag is set to let the processor know to use the x86-style addressing scheme.
Assuming this explanation is reasonable (and if anyone has better-sourced reporting than the above Twitter thread I would appreciate it in the comments for inclusion), is it technically plausible that this optimization could be leveraged for full hardware emulation, for example running x86-64 Linux Docker containers, or running a full x86-64 Windows desktop virtual machine a la VMware Fusion/VirtualBox? Or, does the separate operating system layer in those scenarios preclude being able to leverage the memory ordering optimization?
Separately, is this processor mode (flags or instructions) documented and published for 3rd-party use, or is it private to Apple only?
...ANSWER
Answered 2021-Sep-04 at 20:09Not memory addressing, memory ordering. i.e. for lock-free atomics used for inter-thread synchronization - in x86, every asm load/store is acquire/release respectively. (With real x86 CPUs doing speculative early loads so performance doesn't suffer under normal conditions when a single thread is operating on memory that other threads aren't writing.)
M1 has hardware support for a mode like that, as well as a weakly-ordered mode to run native AArch64 code most efficiently. See
- https://singhkays.com/blog/apple-silicon-m1-black-magic/#performance-must-suck-when-trying-to-emulate-x86-on-arm-right
- https://www.reddit.com/r/hardware/comments/i0mido/apple_silicon_has_a_runtime_toggle_for_tso_to/.
And yes, https://github.com/saagarjha/TSOEnabler is open-source software to toggle that support. But it's a kernel extension, and code signing makes it tricky to get MacOS to allow it, and you have to sign it or disable signature verification or something:
Supposedly, you should be able to use this if you build and sign the kernel extension (disabling SIP if you aren't using a KEXT signing certificate) and drag it into /Library/Extensions. A dialog should come up to prompt you to enable the extension in the Security & Privacy preferences pane, you allow it from there and restart, and it will be installed. (If you're not seeing it, the permissions on the extension might be wrong: try a chown -R root:wheel.) In practice this can go wrong in many ways, and I have had luck by "resetting everything" and trying to install after doing the following:
[...] see the link for the list of steps
So yes, it's plausible that QEMU's x86 emulation could use the same hardware support that Rosetta-2's x86 emulation does. They're both x86 emulators.
And as you say, the main issue is Apple providing public APIs for enabling the HW mode so people don't have to install this kernel module manually; I'm sure most people wouldn't want to do that. I don't know much about the software situation, I was more interested in the CPU-architecture details.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install optimization
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page