ieee754 | Simple JavaScript-based IEEE 754 Encoders and Decoders

by rtoal JavaScript Version: Current License: No License

X-Ray Key Features Code Snippets(1)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | ieee754 Summary

ieee754 is a JavaScript library. ieee754 has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

A single-page web application containing both an IEEE 754 Encoder (for encoding a decimal value into its IEEE-754 single and double precision representation), and an IEEE 754 Decoder (for coverting a 32 or 64-bit hexidecimal representation into a decimal value). The application works entirely in JavaScript; there is no need for a server. You will need a modern browser, as the JavaScript code uses Uint8Array and friends. This application incorporates Michael Mclaughlin's big.js library.

Support

Quality

Security

License

Reuse

Support

ieee754 has a low active ecosystem.

It has 8 star(s) with 3 fork(s). There are 3 watchers for this library.

It had no major release in the last 6 months.

ieee754 has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of ieee754 is current.

Quality

ieee754 has no bugs reported.

Security

ieee754 has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

ieee754 does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

ieee754 releases are not available. You will need to build from source code and install.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of ieee754

Get all kandi verified functions for this library.

ieee754 Key Features

No Key Features are available at this moment for ieee754.

ieee754 Examples and Code Snippets

Check if x is closed .

python

Lines of Code : 89

License : Non-SPDX (Apache License 2.0)

Copy

def embed_check_integer_casting_closed(x,
                                       target_dtype,
                                       assert_nonnegative=True,
                                       name="embed_check_casting_closed"):
  """Ensures int

Community Discussions

Trending Discussions on ieee754

For IEEE-754 floating point arithmetic, is the mantissa in [0.5, 1) or in [1, 2)?

Why does BitConverter seemingly return incorrect results when converting floats and bytes?

Why is 5726718050568503296 truncated in JS

Why implement lround specifically using integer math?

Can multiplication of non-zero floats yield exactly zero?

Why does a calculation error occur when this double value is multiplied by 2 in C++?

math operations on currencies (crypto) stored in sqlite as bigint

Delphi and MSVC do not compare +NAN with zero the same way

Converting from IEEE-754 to Fixed Point with nearest rounding

C++ Convert 4 Hex Values to Float

QUESTION

For IEEE-754 floating point arithmetic, is the mantissa in [0.5, 1) or in [1, 2)?

Asked 2021-Jun-02 at 10:47

I was looking at several textbooks, including Numerical Linear Algebra by Trefethen and Bau, and in the section on floating point arithmetic, they seem to say that in IEEE-754, normalized floating point numbers take the form .1.... X 2^e. That is, the mantissa is assumed to be between 0.5 and 1.

However, in this popular online floating point calculator, it is explained that normalized floating point numbers have a mantissa between 1 and 2.

Could someone please tell me which is the correct way?

...

ANSWER

Answered 2021-Jun-02 at 10:47

The following sets are identical:

{ (−1)^s•f•2^e | s ∈ {0, 1}, f is the value of a 24-bit binary numeral with a radix point after the first digit, and e is an integer such that −126 ≤ e ≤ 127 }.
{ (−1)^s•f•2^e | s ∈ {0, 1}, f is the value of a 24-bit binary numeral with a radix point before the first digit, and e is an integer such that −125 ≤ e ≤ 128 }.
{ (−1)^s•f•2^e | s ∈ {0, 1}, f is the value of a 24-bit binary numeral with a radix point after the last digit, and e is an integer such that −149 ≤ e ≤ 104 }.
{ f•2^e | f is an integer such that |f| < 2²⁴, and e is an integer such that −149 ≤ e ≤ 104 }.

In other words, we may put the radix point anywhere in the significand we want, simply by adjusting the range of the exponent to compensate. Which form to use may be chosen for convenience or preference.

The third form scales the significand so it is an integer, and the fourth form incorporates the sign into the significand. This form is convenient for using number theory to analyze floating-point behavior.

IEEE 754 generally uses the first form. It refers to this as “a scientific form,” reflecting the fact that, in scientific notation, we commonly write numbers with a radix point just after the first digit, as in the mass of the Earth is about 5.9722•10²⁴ kg. In clause 3.3, IEEE 754-2008 mentions “It is also convenient for some purposes to view the significand as an integer; in which case the finite floating-point numbers are described thus:”, followed by text equivalent to the third form above except that it is generalized (the base and other parameters are arbitrary values for any floating-point format rather than the constants I used above specifically for the binary32 format).

Source https://stackoverflow.com/questions/67797756

QUESTION

Why does BitConverter seemingly return incorrect results when converting floats and bytes?

Asked 2021-May-09 at 21:48

I'm working in C# and attempting to pack four bytes into a float (the context is game development, where an RGBA color is packed into a single value). To do this, I'm using BitConverter, but certain conversions seem to result in incorrect bytes. Take the following example (using bytes 0, 0, 129, 255):

...

ANSWER

Answered 2021-May-09 at 21:48

After research, experimentation, and discussion with friends, the root cause of this behavior (bytes changing when converted to and from a float) seems to be signaling vs. quiet NaNs (as Hans Passant also pointed out in a comment). I'm no expert on signaling and quiet NaNs, but from what I understand, quiet NaNs have the highest-order bit of the mantissa set to one, while signaling NaNs have that bit set to zero. See the following image (taken from https://www.h-schmidt.net/FloatConverter/IEEE754.html) for reference. I've drawn four colored boxes around each group of eight bits, as well as an arrow pointing to the highest-order mantissa bit.

Of course, the question I posted wasn't about floating-point bit layout or signaling vs. quiet NaNs, but simply asking why my encoded bytes were seemingly modified. The answer is that the C# runtime (or at least I assume it's the C# runtime) internally converts all signaling NaNs to quiet, meaning that the byte encoded at that position has its second bit swapped from zero to one.

For example, the bytes 0, 0, 129, 255 (encoded in the reverse order, I think due to endianness) puts the value 129 in the second byte (the green box). 129 in binary is 10000001, so flipping its second bit gives 11000001, which is 193 (exactly what I saw in my original example). This same pattern (the encoded byte having its value changed) applies to all bytes in the range 129-191 inclusive. Bytes 128 and lower aren't NaNs, while bytes 192 and higher are NaNs, but don't have their value modified because their second bit (placed at the highest-order mantissa bit) is already one.

So that answers why this behavior occurs, but in my mind, there are two questions remaining:

Is it possible to disable this behavior (converting signaling NaNs to quiet) in C#?
If not, what's the workaround?

The answer to the first question seems to be no (I'll amend this answer if I learn otherwise). However, it's important to note that this behavior doesn't appear consistent across all .NET versions. On my computer, NaNs are converted (i.e. my encoded bytes changed) on every .NET Framework version I tried (starting with 4.8.0, then working back down). NaNs appear to not be converted (i.e. my encoded bytes did not change) in .NET Core 3 and .NET 5 (I didn't test every available version). In addition, a friend was able to run the same sample code on .NET Framework 4.7.2, and surprisingly, the bytes were not modified on his machine. The internals of different C# runtimes isn't my area of expertise, but suffice to say there's variance among versions and computers.

The answer to the second question is to, as others have suggested, simply avoid the float conversion entirely. Instead, each set of four bytes (representing RGBA colors in my case) can either be encoded in an integer or added to a byte array directly.

Source https://stackoverflow.com/questions/67453428

QUESTION

Why is 5726718050568503296 truncated in JS

Asked 2021-Apr-23 at 11:15

As per the standard ES implements numbers as IEEE754 doubles.

And per https://www.binaryconvert.com/result_double.html?decimal=053055050054055049056048053048053054056053048051050057054 and other programming languages https://play.golang.org/p/5QyT7iPHNim it looks like the 5726718050568503296 value can be represented exactly without losing precision.

Why it loses 3 significant digits in JS (reproduced in latest stable google chrome and firefox)

This question was triggered initially from the replicate javascript unsafe numbers in golang

The value is definitely representible in double IEEE754, see how naked bits are converted to a float64 in Go: https://play.golang.org/p/zMspidoIh2w

...

ANSWER

Answered 2021-Apr-23 at 11:15

The default rule for JavaScript when converting a Number value to a decimal numeral is to use just enough digits to distinguish the Number value. Specifically, this arises from step 5 in clause 7.1.12.1 of the ECMAScript 2017 Language Specification, per the linked answer. (It is 6.1.6.1.20 in the 2020 version.)

So while 5,726,718,050,568,503,296 is representable, printing it yields “5726718050568503000” because that suffices to distinguish it from the neighboring representable values, 5,726,718,050,568,502,272 and 5,726,718,050,568,504,320.

You can request more precision in the conversion to string with .toPrecision, as in x.toPrecision(21).

Source https://stackoverflow.com/questions/67222143

QUESTION

Why implement lround specifically using integer math?

Asked 2021-Apr-20 at 00:10

I noticed that the C++ standard library has separate functions for round and lround rather than just having you use long(round(x)) for the latter.

Looking into the implementation in glibc, I find that indeed, for platforms using IEEE754 floating point, the version that returns an integer will directly manipulate the bits from within the floating point representation, and not do the rounding using floating point operations (e.g. adding ±0.5).

What is the benefit of having a distinct implementation when you want the result as an integer type? Is this supposed to be faster, or more accurate? If it is better to use integer math on the underlying representation, why not just always do it that way even if returning the result as a double?

...

ANSWER

Answered 2021-Apr-20 at 00:10

One reason is that adding .5 is insufficient. Let’s say you add .5 and then truncate to an integer. (How? Is there an instruction for that? Or are you doing more work?) If x is ½−2⁻⁵⁴ (the greatest representable value less than ½), adding .5 yields 1, because the mathematical sum, 1−2⁻⁵⁴, is exactly halfway between the nearest two representable values, 1−2⁻⁵³ and 1, and the common default rounding mode, round-to-nearest-ties-to-even, rounds that to 1. But the correct result for lround(x) is 0.

And, of course, lround is specified to round ties away from zero, regardless of the current rounding mode. You could set the rounding mode, do some arithmetic, and restore the rounding mode, but there are problems with this.

One is that changing the rounding mode is a typically a time-consuming operation. The rounding mode is a global state that affects most floating-point instructions. So the processor has to ensure all pending instructions complete with the prior mode, change the global state, and ensure all later instructions start after that change.

If you are lucky, you might have a processor with per-instruction rounding modes or something similar, and then you can use any rounding mode you like without time penalty. Hewlett Packard has some processors like that. However, “round away from zero” is an uncommon mode. Most processors have round-to-nearest-ties-to-even, round toward zero, round down (toward −∞), and round up (toward +∞), and round-to-odd is becoming popular for its value in avoiding double-rounding errors. But round away from zero is rare.

Another reason is that doing floating-point instructions alters the floating-point status flags and may generate traps, but it is desired that library routines behave as single operations. For example, if we add .5 and rounding occurs, the inexact flag will be raised, since the floating-point addition with .5 produced a result different from the mathematical sum. But to the user of lround, no inexact condition ever occurs; lround is defined to return a value rounded to an integer, and it always does so—within the long range, it never returns a computed result different from its ideal mathematical definition. So if lround(x) raised the inexact flag, that would be incorrect behavior. To avoid it, an implementation that used floating-point instructions would have to save the current floating-point flags, do its work, and restore the flags before returning.

Source https://stackoverflow.com/questions/67166815

QUESTION

Can multiplication of non-zero floats yield exactly zero?

Asked 2021-Apr-07 at 22:53

Suppose I have a series of small random floating point numbers in either double or float format which are guaranteed to be non-zero, in a CPU which follows the IEEE754 standard, and I make multiplications between two of these small numbers.

If both numbers are different from zero but very small (below machine epsilon), is it possible that a multiplication result would yield zero or negative zero, such that if I interpret the result as a C++ boolean, it would translate into false?

...

ANSWER

Answered 2021-Apr-07 at 22:53

Yes. You can demonstrate that by experiment:

Source https://stackoverflow.com/questions/66994997

QUESTION

Why does a calculation error occur when this double value is multiplied by 2 in C++?

Asked 2021-Mar-18 at 14:52

#include 
#include 
#include 
#include 
#include 
#include 
using namespace std;
int main()
{
    uint64_t int_value = 0b0000000000001101000011110100001011000000110111111010111101110011;
    double double_value = (*((double *)((void *)&int_value)));
    printf("double initiate value: %e\n", double_value);
    cout << "sign " << setw(11) << "exp"
         << " " << setw(52) << "frac" << endl;
    for (int i = 0; i < 10; i++)
    {
        stringstream ss;
        ss << bitset<64>((*((uint64_t *)((void *)&double_value))));
        auto str = ss.str();
        cout << setw(4) << str.substr(0, 1) << " " << setw(11) << str.substr(1, 11) << " " << str.substr(12, 52) << endl;
        double_value *= 2;
    }
}

...

ANSWER

Answered 2021-Mar-18 at 11:14

You are running into denormalised numbers. When the exponent is zero, but the mantissa is not, then the mantissa is used as is without an implicit leading 1 digit. This is done so that the representation can handle very small numbers that are smaller than what the smallest exponent could represent. So in the first two rows of your example:

Source https://stackoverflow.com/questions/66687103

QUESTION

math operations on currencies (crypto) stored in sqlite as bigint

Asked 2021-Mar-12 at 17:29

I'm trying to store cryptocurrencies values inside a sqlite database. I read that is not correct to store those values as float nor double because the loss of precision caused by the IEEE754. For this reason I saved this values as biginteger in my database. (And I multiply or divide by 10^8 or 10^(-8) in my app before reading or storing the values).

...

ANSWER

Answered 2021-Mar-12 at 17:29

The key is to just multiply the divident instead of multiplying the result.

If both total_fiat_amount-commission and crypto_fiat_price are mononitery values with a maximum of two digits after the comma, you don't need to multiply both with 10^8 but only with 10^2.

In that case, the result would be accurate to 0 decimal points of precision after the comma.

If you want to have 8 decimal pieces of precision after the comma, you can multiply the divident with 10^8 before running the division.

If you store total_fiat_amount, commission and crypto_fiat_price in cents, you could use this:

Source https://stackoverflow.com/questions/66602415

QUESTION

Delphi and MSVC do not compare +NAN with zero the same way

Asked 2021-Feb-01 at 13:24

I am porting C code to Delphi and find an issue in the way the compilers (Delphi 10.4.1 and MSVC2019, both targeting x32 platform) handle comparison of +NAN to zero. Both compilers use IEEE754 representation for double floating point values. I found the issue because the C-Code I port to Delphi is delivered with a bunch of data to validate the code correctness.

The original source code is complex but I was able to produce a minimal reproducible application in both Delphi and C.

C-Code:

...

ANSWER

Answered 2021-Feb-01 at 13:24

First of all, your Delphi program does not behave as you describe, at least on the Delphi version readily available to me, XE7. When your program is run, an invalid operation floating point exception is raised. I'm going to assume that you have actually masked floating point exceptions.

Update: It turns out that at some time between XE7 and 10.3, Delphi 32 bit codegen switched from fcom to fucom which explains why XE7 sets the IA floating point exception, but 10.3 does not.

Your Delphi code is very far from minimal. Let's try to make a truly minimal example. And let's look at other comparison operators.

Source https://stackoverflow.com/questions/65991239

QUESTION

Converting from IEEE-754 to Fixed Point with nearest rounding

Asked 2021-Jan-30 at 23:46

I am implementing a converter for IEEE 754 32 bits to a Fixed point with S15.16 in a FPGA. The IEEE-754 standard represent the number as:

Where s represent the sign, exp is the exponent denormalized and m is the mantissa. All these values separately are represented in fixed point.

Well, the simplest way is take the IEEE-754 value and multiplies by 2**16. Finally, round it to the nearest to get the less error in truncation.

Problem: I'm doing in a FPGA device, so, I can't do it in this way.

Solution: Use the binary representations from values to perform the conversion via bitwise operations

From the previous expression, and with the condition of the exponent and mantissa are in fixed point, logic says me that I can perform as this:

Because powers of two are shifts in fixed point, is possible to rewrite the expression as (with Verilog notation):

...

ANSWER

Answered 2021-Jan-30 at 23:46

The ISO-C99 code below demonstrates one possible way of doing the conversion. The significand (mantissa) bits of the binary32 argument form the bits of the s15.16 result. The exponent bits tell us whether we need to shift these bits right or left to move the least significant integer bit to bit 16. If a left shift is required, rounding is not needed. If a right shift is required, we need to capture any less significant bits discarded. The most significant discarded bit is the round bit, all others collectively represent the sticky bit. Using the literal definition of the rounding mode, we need to round up if (1) either the round bit and the sticky bit are set, or (2) the round bit is set and the sticky bit clear (i.e., we have a tie case), but the least significant bit of the intermediate result is odd.

Note that real hardware implementations often deviate from such a literal application of the rounding-mode logic. One common scheme is to first increment the result when the round bit is set. Then, if such an increment occurred, clear the least significant bit of the result if the sticky bit is not set. It is easy to see that this achieves the same effect by enumerating all possible combinations of round bit, sticky bit, and result LSB.

Source https://stackoverflow.com/questions/65970751

QUESTION

C++ Convert 4 Hex Values to Float

Asked 2021-Jan-30 at 12:42

I am trying to convert 4 Hex Values to a float. The Hex Values are for example 3F A0 00 00. In binary representation they would correspond to 00111111 10100000 00000000 00000000. If these 4 binary values are interpreted as one 32-bit float (accordingly to IEEE754) the decimal value of the float should be 1,25. However i am struggling to automaticaly make the conversion from hex values to decimal float in C++ (I am using Qt as Framework).

Can anybody help me please?

Thanks!

...

ANSWER

Answered 2021-Jan-27 at 13:47

#include 
#include 
using namespace std;

int main()
{
    unsigned char bytes[] = {0x3F, 0xA0, 0x00, 0x00};
    
    uint32_t x = bytes[0];
    for (int i = 1; i < std::size(bytes); ++i) x = (x << 8) | bytes[i];

    static_assert(sizeof(float) == sizeof(uint32_t), "Float and uint32_t size dont match. Check another int type");
    
    float f{};
    memcpy(&f, &x, sizeof(x));
    // or since C++20 if available: float f = std::bit_cast(x)
    
    cout << "f = " << f << endl;
}

Source https://stackoverflow.com/questions/65920107

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install ieee754

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: