assembler | stateless reactive Java API for efficient implementation | Object-Relational Mapping library
kandi X-RAY | assembler Summary
kandi X-RAY | assembler Summary
Lightweight library allowing to efficiently assemble entities from querying/merging external datasources or aggregating microservices. More specifically it was designed as a very lightweight solution to resolve the N + 1 query problem when aggregating data, not only from database calls (e.g. Spring Data JPA, Hibernate) but from arbitrary datasources (relational databases, NoSQL, REST, local method calls, etc.). One key feature is that the caller doesn't need to worry about the order of the data returned by the different datasources, so no need for example (in a relation database context) to modify any SQL query to add an ORDER BY clause, or (in a REST context) to modify the service implementation or manually sort results from each call before triggering the aggregation process. Stay tuned for more complete documentation very soon in terms of more detailed explanations regarding how the library works and comparisons with other solutions, a dedicated series of blog posts is also coming on
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Package private for testing
- Determines if an iterable is not empty
- Converts the given iterable to a stream
- Returns true if the given iterable is empty
- A consumer that accepts the second consumer
- Overrides the default implementation
- Creates a new checked consumer that matches the supplied consumer
- Overrides the visitor
- Converts mapper sources into a stream
- Converts the stream into a parallel stream
- Performs a logical OR operation that is satisfied
- A function that applies the provided function to the callback
- The main run method
- Gets the value of the wrapped property
- Returns a function that applies the three arguments
- Default function to apply a function to a function
- Returns a new Function2 that applies the supplied Function1 and function2
- Performs logical OR of two checks
- Applies a function to the function
- Converts the given mapper sources into a single - valued collection
- Determines whether two predicates are satisfied
- Adapts a function to a CheckedFunction
- A function that applies the supplied function to the supplied function
- Perform a function accepting a CheckedFunction
- Determines whether a predicate is satisfied
- Performs logical OR operation on the given predicate1 and other
assembler Key Features
assembler Examples and Code Snippets
Community Discussions
Trending Discussions on assembler
QUESTION
These codes convert uppercase letters ("letters only") to lowercase letters and lowercase letters to uppercase. My question is that I want to print them as well and keep them unchanged, if any (non-verbal symbols and actors). With the cmp and ... commands that you see in the program
...ANSWER
Answered 2021-Jun-13 at 16:03You need to restrict the ranges for the uppercase and lowercase characters by specifying a lower limit and a higher limit, not just the one value (96) that your current code uses.
Uppercase characters [A,Z] are in [65,90]
Lowercase characters [a,z] are in [97,122]
The nice thing of course is that you don't actually need to write these numbers in your code. You can just write the relevant characters and the assembler will substitute them for you:
QUESTION
Using Meson build and GCC on a personal C11 project with GLib on Linux Fedora, I keep randomly getting the following error when compiling :
...ANSWER
Answered 2021-Jun-09 at 19:01Meson build here ; basically I have two "executable" target with a shared source, and it seems the first overlap the second for some reason although it should be sequentially.
Example :
QUESTION
I am going to write my own FORTH "engine" in GNU assembler (GAS) for Linux x86-64 (specifically for AMD Ryzen 9 3900X that is siting on my table).
(If it will be success, I may use similar idea for make firmware for retro 6502 and similar home-brewed computer)
I want to add some interesting debugging features, as saving comments with the compiled code in for of "NOP words" with attached strings, which would do nothing in runtime, but when disassembling/printing out already defined words it would print those comment too, so it would not loose all the headers ( a b -- c) and comments like ( here goes this particular little trick ) and I would be able try to define new words with documentation, and later print all definitions in some nice way and make new library from those, which I consider good. (And have switch to just ignore comments for "production release")
I had read too much of optimalization here and I am not able to understand all of that in few weeks, so I will put out microoptimalisation until it will suffer performance problems and then I will start with profiling.
But I want to start with at least decent architectural decisions.
What I understood yet:
- it would be nice, if the programs was run mainly from CPU cache, not from memory
- the cache is filled somehow "automagically", but having related data/code compact and as near as possible may help a lot
- I identified some areas, that would be good candidates for caching and some, that are not so good - I sorted it in order of importance:
- assembler code - the engine and basic words like "+" - used all the time (fixed size, .text section)
- both stacks - also used all the time (dynamic, I will probably use rsp for data stack and implement return stack independly - not sure yet, which will be "native" and which "emulated")
- forth bytecode - the defined and compiled words - used at runtime, when the speed matters (still growing size)
- variables, constants, strings, other memory allocations (used in runtime)
- names of words ("DUP", "DROP" - used only when defining new words in compilation phase)
- comments (used one daily or so)
As there is lot of "heaps" that grows up (well, there is not "free" used, so it may be also stack, or stack growing up) (and two stacks that grows down) I am unsure how to implement it, so the CPU cache will cover it somehow decently.
My idea is to use one "big heap" (and increse it with brk() when needed), and then allocate big chunks of alligned memory on it, implementing "smaller heaps" in each chunk and extend them to another big chunk when the old one is filled up.
I hope, that the cache would automagically get the most used blocks first keep it most of the time and the less used blocks would be mostly ignored by the cache (respective it would occupy only small parts and get read and kicked out all the time), but maybe I did not it correctly.
But maybe is there some better strategy for that?
...ANSWER
Answered 2021-Jun-04 at 23:53Your first stops for further reading should probably be:
- What Every Programmer Should Know About Memory? re: cache
- https://agner.org/optimize/ re: everything else about writing efficient asm.
- https://uops.info/ for a better version of Agner Fog's instruction tables.
- See also other links in https://stackoverflow.com/tags/x86/info
so I will put out microoptimalisation until it will suffer performance problems and then I will start with profiling.
Yes, probably good to start trying stuff so you have something to profile with HW performance counters, so you can correlate what you're reading about performance stuff with what actually happens. And so you get some ideas of possible details you hadn't thought of yet before you go too far into optimizing your overall design idea. You can learn a lot about asm micro-optimization by starting with something very small scale, like a single loop somewhere without any complicated branching.
Since modern CPUs use split L1i and L1d caches and first-level TLBs, it's not a good idea to place code and data next to each other. (Especially not read-write data; self-modifying code is handled by flushing the whole pipeline on any store too near any code that's in-flight anywhere in the pipeline.)
Related: Why do Compilers put data inside .text(code) section of the PE and ELF files and how does the CPU distinguish between data and code? - they don't, only obfuscated x86 programs do that. (ARM code does sometimes mix code/data because PC-relative loads have limited range on ARM.)
Yes, making sure all your data allocations are nearby should be good for TLB locality. Hardware normally uses a pseudo-LRU allocation/eviction algorithm which generally does a good job at keeping hot data in cache, and it's generally not worth trying to manually clflushopt
anything to help it. Software prefetch is also rarely useful, especially in linear traversal of arrays. It can sometimes be worth it if you know where you'll want to access quite a few instructions later, but the CPU couldn't predict that easily.
AMD's L3 cache may use adaptive replacement like Intel does, to try to keep more lines that get reused, not letting them get evicted as easily by lines that tend not to get reused. But Zen2's 512kiB L2 is relatively big by Forth standards; you probably won't have a significant amount of L2 cache misses. (And out-of-order exec can do a lot to hide L1 miss / L2 hit. And even hide some of the latency of an L3 hit.) Contemporary Intel CPUs typically use 256k L2 caches; if you're cache-blocking for generic modern x86, 128kiB is a good choice of block size to assume you can write and then loop over again while getting L2 hits.
The L1i and L1d caches (32k each), and even uop cache (up to 4096 uops, about 1 or 2 per instruction), on a modern x86 like Zen2 (https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Architecture) or Skylake, are pretty large compared to a Forth implementation; probably everything will hit in L1 cache most of the time, and certainly L2. Yes, code locality is generally good, but with more L2 cache than the whole memory of a typical 6502, you really don't have much to worry about :P
Of more concern for an interpreter is branch prediction, but fortunately Zen2 (and Intel since Haswell) have TAGE predictors that do well at learning patterns of indirect branches even with one "grand central dispatch" branch: Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore
QUESTION
From my understanding (although I am just learning this so I am likely wrong), we use URIs in RESTful APIs to identify resources.
In particular, I've read this comment many times across the web:
@RequestParam is more useful on a traditional web application where data is mostly passed in the query parameters while @PathVariable is more suitable for RESTful web services where URL contains values.
When reading this, it seems like we need to use @PathVariable to properly build a RESTful API instead of using @RequestParam (which is not RESTFUL given the way it identifies resources through query parameters instead of URIs). Is this correct? If so, how can I build a RESTful API that is also capable of sorting or filtering like you would be able to do with @RequestParam?
An example is here:
...ANSWER
Answered 2021-Jun-02 at 15:21You should examine the Pageable
type in Spring. Generally, what you've shown here is called something like an item resource: It's one specific Car, specified by ID at /cars/{id}
. There's nothing to sort or filter. Those are applied to the collection resource /cars
, perhaps something like /cars?sort=year&page=0&size=10
. Spring has integrated support for automatically extracting query parameters indicating sorting and paging and passing those instructions to Spring Data repositories.
QUESTION
I am starting to learn x86_64 assembly, one thing I noticed is the usage of registers such as rdi, rbp, rax, rbx. Do they exist in the CPU or is this some sort of abstract mechanism used by assembler?
For example if I do
...ANSWER
Answered 2021-May-30 at 01:59CPU hardware doesn't find registers by name, it's up to the assembler to translate names like rax
to 3 or 4-bit register numbers in machine code. (And the operand-size implied by the register name is also encoded via the opcode and (lack of) prefixes).
e.g. add ecx, edx
assembles to
01 d1
. Opcode 01
is add r/m32, r
. The 2nd byte, the ModRM 0xd1 = 0b0b11010001
, encodes the operands: the high 2 bits (11) are the addressing mode, plain register, not memory (for the dest in this case, because it's 01 add r/m32, r
not 03 add r32, r/m32
).
The middle 3 bits are the /r
field, and 010
= 2 is the register number for edx.
The low 3 bits are the r/m
field, and 001
is the register number of ECX.
(The numbering goes EAX, ECX, EDX, EBX, ..., probably because 8086 was designed for asm source compatibility with 8080 - i.e. "porting" on a per-instruction basis simple enough for a machine to do automatically.)
This is what the CPU is actually decoding, and what it uses to "address" its internal registers. A simple in-order CPU without register renaming could literally use these numbers directly as addresses in an SRAM that implemented the register file. (Especially if it was a RISC like MIPS or ARM. x86 is complicated because you can use the same register numbers with different widths, and you have partial registers like AH and AL mapping onto halves of AX. But still, it's just a matter of mapping register numbers to locations in SRAM, if you didn't do register renaming.)
For x86-64, register numbers are always 4-bit, but sometimes the leading zero is implicit, e.g. in an instruction without a REX prefix like mov eax, 60
. The register number is in the low 3 bits of the opcode for that special encoding.
Physically, modern CPUs use a physical register file and a register-renaming table (RAT) to implement the architectural registers. So they can keep track of the value of RAX at multiple points in time. e.g. mov eax, 60
/ push rax
/ mov eax, 12345
/ push rax
can run both mov
instructions in parallel, writing to separate physical registers. But still sorting out which one each push
should read from.
if thats the case, i am wondering why there are only 16 registers in x86_64 architecture ...
A new ISA being designed for the high-performance use-cases where x86 competes would very likely have 32 integer registers. But shoehorning that into x86 machine code (like AVX-512 did for vector regs), wouldn't be worth the code-size cost.
x86-64 evolved out of 16-bit 8086, designed in 1979. Many of the design choices made then are not what you'd make if starting fresh now, with modern transistor budgets. (And not aiming for asm source-level compatibility with 8-bit 8080).
More architectural registers costs more bits in the machine code for each operand. More physical registers just means more out-of-order exec capability to handle more register renaming. (The physical register numbering is an internal detail.) This article measures practical out-of-order window size for hiding cache miss latency and compares it to known ROB and PRF sizes - in some cases the CPU runs out of physical registers to rename onto, before it fills the ROB, for that chosen mix of filler instructions.
, doesn't more registers means more performance ?
More architectural registers does generally help performance, but there are diminishing returns. 16 avoids a lot of store/reload work vs. 8, but increasing to 32 only saves a bit more store/reload work; 16 is often enough for compilers to keep everything they want in registers.
The fact that AMD managed to extend it to 16 registers (up from 8) is already a significant improvement. Yes, 32 integer regs would be somewhat better sometimes, but couldn't be done without redesigning the machine-code format, or with much longer prefixes (like AVX-512's 4-byte EVEX prefix, which allow 32 SIMD registers, x/y/zmm0..31 for AVX-512 instructions.)
See also:
https://www.realworldtech.com/sandy-bridge/5/ - Intel Sandybridge was when Intel started using a PRF (Physical Register File) instead of keeping temporary values in the ROB (Reorder Buffer).
Related Q&As:
QUESTION
Is it possible to assemble multiple agents at once in an "assembler" block. I mean that if an agent is composed of two types of parts, and there are three parts of each type, and we have enough resources (e.g. 3 workers) is it possible to assemble these three products simultaneously. in my model they are assembled one by one and there is no option like queue capacity which does the same in "service" block.
Any Idea?
...ANSWER
Answered 2021-May-26 at 13:36Assembler always works one-at-a-time. To have this work in parallel, the model would need to contain 3 separate assemblers with some sort of routing logic to ensure that parts are spread across the assemblers and don't all end up in one block to be processed serially.
QUESTION
I'm trying to find any information parentheses syntax for macro arguments in GNU Assembler. E.g. I have following code:
...ANSWER
Answered 2021-May-25 at 12:51Okay, I've found the answer. This is special syntax to escape macro-argument name.
From the documentation:
Note that since each of the macargs can be an identifier exactly as any other one permitted by the target architecture, there may be occasional problems if the target hand-crafts special meanings to certain characters when they occur in a special position. For example:
...
problems might occur with the period character (‘.’) which is often allowed inside opcode names (and hence identifier names). So for example constructing a macro to build an opcode from a base name and a length specifier like this:
QUESTION
In MSVC there exist instrinsics __emulu() and _umul128(). First does u32*u32->u64
multiplication and second u64*u64->u128
multiplication.
Do same intrinsics exist for CLang/GCC?
Closest I found are _mulx_u32()
and _mulx_u64()
mentioned in Intel's Guide. But they produce mulx
instruction which needs BMI2 support. While MSVC's intrinsics produce regular mul
instruction. Also _mulx_u32()
is not available in -m64
mode, while __emulu()
and _umul128()
both exist in 32 and 64 bit mode of MSVC.
You may try online 32-bit code and 64-bit code.
Of cause for 32-bit one may do return uint64_t(a) * uint64_t(b);
(see it online) hoping that compiler will guess correctly and optimize to using u32*u32->u64
multiplication instead of u64*u64->u64
. But is there a way to be sure about this? Not to rely on compiler's guess that both arguments are 32-bit (i.e. higher part of uint64_t is zeroed)? To have some intrinsics like __emulu()
that make you sure about code.
There is __int128
in GCC/CLang (see code online) but again we have to rely on compiler's guess that we actually multiply 64-bit numbers (i.e. higher part of int128 is zeroed). Is there a way to be sure without compiler guessing, if there exist some intrinsics for that?
BTW, both uint64_t
(for 32-bit) and __int128
(for 64-bit) produce correct mul
instruction instead of mulx
in GCC/CLang. But again we have to rely that compiler guesses correctly that higher part of uint64_t
and __int128
is zeroed.
Of cause I can look into assembler code that GCC/Clang have optimized and guessed correctly, but looking at assembler once doesn't guarantee that same will happen always in all circumstances. And I don't know of a way in C++ to statically assert that compiler did correct guess about assembler instructions.
...ANSWER
Answered 2021-May-24 at 12:54You already have the answer. Use uint64_t
and __uint128_t
. No intrinsics needed. This is available with modern GCC and Clang for all 64-bit targets. See Is there a 128 bit integer in gcc?
QUESTION
I would like to make a simple bootloader, that writes all the background colors to next lines on the screen.
The problem is, that it only changes the color of the first line to black and the second line to blue, while it should display all 16 colors. I think, that there is something wrong with loop1:, but I don't know what.
Useful informations:
- I am writing directly to the text video memory, starting from address 0xb8000, using method described in this forum post.
- I am using flat assembler 1.73.27 for Windows (fasm assembler).
- I am testing my program on real computer (booting from usb), not an emulator.
- I am not including any photos, because of this post.
My code (fasm assembly):
...ANSWER
Answered 2021-May-12 at 14:29You are leaving loop1
when ah
is not less than 0xff
anymore.
The term less/greater is used in x86 assembly when signed numbers are compared.
Number 0xff
treated as signed 8bit integer has the value -1
and ah
as signed byte (-128..+127), starts at 0x0f + 0x10 = 0x1f
. And 31 < -1
is false on the first iteration, so loop1
is abandoned after the first call procedure1
.
When comparing unsigned numbers we use term below/above.
Instead of jl loop1
use jb loop1
. Similary in procedure1:
replace jl procedure1
with jb procedure1
. (https://www.felixcloutier.com/x86/jcc)
QUESTION
Please, correct me if I'm wrong anywhere...
What I want to do: I want to find a certain function inside some DLL, which is being loaded by Windows service, during remote kernel debugging via WinDBG. (WinDBG plugin in IDA + VirtualKD + VMWare VM with Windows 10 x64). I need to do it kernel mode, because I need to switch the processes and see all the memory
What I did:
- I found an offset to the function in IDA (unfortunately, the DLL doesn't have debug symbols).
- Connected to the VM in Kernel Mode.
- Found the process of the service by iterating over the svchost-processes (
!process 0 0 svchost.exe
) and looking at CommandLine field in their PEBs (C:\Windows\system32\svchost.exe -k ...
). - Switched to the process (
.process /i
; g
), refreshed the modules list (.reload
) - Found the target DLL in user modules list and got its base address.
The problem:
The DLL loaded into memory doesn't fully correspond to the original DLL-file, so I can't find the function there.
When I jump to the address like +
there is nothing there and around. But I found some other functions using this method, so it looks correct.
Then I tried to find the sequence of bytes belonging to the function according to the original DLL-file and also got nothing.
The function uses strings, which I found in data section, but there are no xrefs to them.
Looks like that function has completely disappeared...
What am I doing wrong?
P.S.: Also I dumped memory from to
and compared it with the original file. Besides different jump addresses and offsets, sometimes the assembler code is completely missed...
ANSWER
Answered 2021-May-19 at 12:35It appeared that the memory pages were paged out. .pagein
command did the trick
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install assembler
You can use assembler like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the assembler component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page