o2 | 2D Game Engine with visual WYSIWYG editor | Game Engine library
kandi X-RAY | o2 Summary
kandi X-RAY | o2 Summary
o2 - it's an open-source technology for easy making 2D games and applications for mobile and PC platforms using C++ and Lua with very flexible editor. The main features are performance, usability and effective development. Here is the test project: Now work in progress. Discord channel -
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of o2
o2 Key Features
o2 Examples and Code Snippets
function copyState(o1, o2) {
o2.fillStyle = o1.fillStyle;
o2.lineCap = o1.lineCap;
o2.lineJoin = o1.lineJoin;
o2.lineWidth = o1.lineWidth;
o2.miterLimit = o1.miterLimit;
o2.shadowBlur = o1.shadowBlur;
Community Discussions
Trending Discussions on o2
QUESTION
If I compile this code with GCC or Clang and enable -O2
optimizations, I still get some global object initialization. Is it even possible for any code to reach these variables?
ANSWER
Answered 2022-Mar-18 at 06:44Compiling that code with short string optimization (SSO) may be an equivalent of taking address of std::string
's member variable. Constructor have to analyze string length at compile time and choose if it can fit into internal storage of std::string
object or it have to allocate memory dynamically but then find that it never was read so allocation code can be optimized out.
Lack of optimization in this case might be an optimization flaw limited to such simple outlying examples like this one:
QUESTION
I don't have much experience in go but I have been tasked to execute a go project :)
So i need to build the go project and then execute it
Below is the error when i build the go project. Seems to be some dependency(package and io/fs) is missing
...ANSWER
Answered 2021-Aug-12 at 05:56This package requires go v1.16, please upgrade your go version or use the appropriate docker builder.
QUESTION
gcc and clang disagree about whether the following code should compile:
...ANSWER
Answered 2022-Mar-13 at 12:00It is not valid because:
a type parameter pack cannot be expanded in its own parameter clause.
As from [temp.param]/17:
If a template-parameter is a type-parameter with an ellipsis prior to its optional identifier or is a parameter-declaration that declares a pack ([dcl.fct]), then the template-parameter is a template parameter pack. A template parameter pack that is a parameter-declaration whose type contains one or more unexpanded packs is a pack expansion. ... A template parameter pack that is a pack expansion shall not expand a template parameter pack declared in the same template-parameter-list.
So consider the following invalid example:
QUESTION
While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:
...ANSWER
Answered 2022-Mar-08 at 10:00Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:
Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But
xchg reg,reg
already does that better.
I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.
Hopefully sometimes that lets it then optimize down to only a single mov
, or two mov
s, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov
instructions.
But worst case, three mov
instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg
instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.
(Apparently GCC -Os
does have a peephole optimization to use xchg reg,reg
instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)
with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?
Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)
It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a
to output a
for xor-swap, and 2 from input b
to output a
, for example.
MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)
Also related:
- Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD
xchg reg,reg
is only 2 uops.
GCC's real missed optimization here (even with -O3
) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a
path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.
update: GCC will do this for you with -ftracer
(https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use
, not at -O3
, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)
Doing it manually in the source (Godbolt):
QUESTION
While debugging a problem in a numerical library, I was able to pinpoint the first place where the numbers started to become incorrect. However, the C++ code itself seemed correct. So I looked at the assembly produced by Visual Studio's C++ compiler and started suspecting a compiler bug.
CodeI was able to reproduce the behavior in a strongly simplified, isolated version of the code:
sourceB.cpp:
...ANSWER
Answered 2022-Feb-18 at 23:52Even though nobody posted an answer, from the comment section I could conclude that:
- Nobody found any undefined behavior in the bug repro code.
- At least some of you were able to reproduce the undesired behavior.
So I filed a bug report against Visual Studio 2019.
The Microsoft team confirmed the problem.
However, unfortunately it seems like Visual Studio 2019 will not receive a bug fix because Visual Studio 2022 seemingly does not have the bug. Apparently, the most recent version not having that particular bug is good enough for Microsoft's quality standards.
I find this disappointing because I think that the correctness of a compiler is essential and Visual Studio 2022 has just been released with new features and therefore probably contains new bugs. So there is no real "stable version" (one is cutting edge, the other one doesn't get bug fixes). But I guess we have to live with that or choose a different, more stable compiler.
QUESTION
This code is missing a constructor initializer list:
...ANSWER
Answered 2022-Feb-10 at 13:48You could add -Weffc++
to catch it (inspired by Scott Meyers book "Effective C++"). Strangely enough it does not refer to any other -W
option (and neither does clang++
).
The option is however considered, by some, a bit outdated by now, but in this case, it's finding a real problem.
QUESTION
I'm trying to use packages that require Rcpp
in R on my M1 Mac, which I was never able to get up and running after purchasing this computer. I updated it to Monterey in the hope that this would fix some installation issues but it hasn't. I tried running the Rcpp
check from this page but I get the following error:
ANSWER
Answered 2022-Feb-10 at 21:07Currently (2022-02-05), CRAN builds R binaries for Apple silicon using Apple clang
(from Command Line Tools for Xcode 12.4) and an experimental build of gfortran
.
If you obtain R from CRAN (i.e., here), then you need to replicate CRAN's compiler setup on your system before building R packages that contain C/C++/Fortran code from their sources (and before using Rcpp
, etc.). This requirement ensures that your package builds are compatible with R itself.
A further complication is the fact that Apple clang
doesn't support OpenMP, so you need to do even more work to compile programs that make use of multithreading. You could circumvent the issue by building R itself and all R packages from sources with LLVM clang
, which does support OpenMP, but this approach is onerous and "for experts only". There is another approach that has been tested by a few people, including Simon Urbanek, the maintainer of R for macOS. It is experimental and also "for experts only", but seems to work on my machine and is simpler than trying to build R yourself.
Warning: These instructions come with no warranty and could break at any time. They assume some level of familiarity with C/C++/Fortran program compilation, Makefile syntax, and Unix shells. As usual, sudo
at your own risk.
I will try to address compilers and OpenMP support at the same time. I am going to assume that you are starting from nothing. Feel free to skip steps you've already taken, though you might find a fresh start helpful.
I've tested these instructions on a machine running Big Sur, and at least one person has tested them on a machine running Monterey. I would be glad to hear from others.
Download an R binary from CRAN here and install. Be sure to select the binary built for Apple silicon.
Run
QUESTION
I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3
flag made it run even slower than no flags at all! Meanwhile -O2
was making it run a lot faster as expected.
Without optimisations:
...ANSWER
Answered 2021-Oct-27 at 19:53It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.
(gcc -O3
enables -ftree-vectorize
and a few other options not included by -O2
, e.g. if
-conversion to branchless cmov
, which is another way -O3
can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2
, although some of its optimizations are still only on at -O3
.)
It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.
(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)
Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)
Even better would be to keep buf[x+1]
in a register and use it as buf[x]
in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)
If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind
/ pmaxd
comparator might be interesting, but that would mean always storing and the C source doesn't do that.
If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,
QUESTION
I want to build an Android app which will be an interface to convert C++ into assembly code for ARM Cortex M3 architecture.
I'm not an android java developer, and I do mainly arduino projects with C/C++. So I need your help to point me in good directions about how to build an android app with java in Android Studio or similar, which will be able to convert from C++ source code to ASM code M3 Cortex.
I did some research and found that I need to use ARM NONE EABI GCC compiler to generate ASM code from C++, simple like these command line instructions:
...ANSWER
Answered 2021-Dec-16 at 16:58A solution would be if in Termux app you will do next things: (more details here)
pkg install proot
pkg install proot-distro
proot-distro install debian
proot-distro login debian
After that you should be logged in a Debian environment, and you can install almost any Arm packages available on debian repositories.
For example you should be able to install this Cortex compiler:
QUESTION
I'm using godbolt to get assembly of the following program:
...ANSWER
Answered 2021-Dec-13 at 06:33You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul
instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea
instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal
instruction (2 per cycle and 1 cycle of latency).
This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.
Note that the lea
instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea
is faster than imul
/mul
.
I can reproduce the slower execution with -O2
using GCC 11.2 (on Linux with a i5-9600KF processor).
The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2
version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.
Here is the assembly of the loop with -Os
:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install o2
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page