← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Floating-Point Representation (IEEE 754)

College Depth 82 in the knowledge graph ☐ I know this ☆ Set as goal

78topics build on this

338prerequisites beneath it

See this on the map →

Two's Complement Representation Binary Arithmetic +2 more→→Arithmetic Logic Unit (ALU)Machine Epsilon and Unit Roundoff

Core Idea

IEEE 754 floating-point represents real numbers in binary scientific notation: a sign bit, a biased exponent, and a significand (mantissa). A 32-bit single-precision float has 1 sign bit, 8 exponent bits, and 23 mantissa bits. The format can represent very large and very small numbers but introduces rounding errors because most real numbers cannot be represented exactly in a finite number of bits. Special values like infinity, negative zero, and NaN (Not a Number) handle edge cases in computation.

How It's Best Learned

Encode and decode several floating-point values by hand using the IEEE 754 formula. Explore precision loss by computing (1.0 + epsilon == 1.0) in a programming language. Use visualization tools to see the distribution of representable values.

Common Misconceptions

Floating-point rounding errors are not bugs in the processor; they are inherent to finite-precision real-number approximation.
0.1 cannot be represented exactly in binary floating point, just as 1/3 cannot be represented exactly in decimal.

Explainer

You already know how two's complement represents signed integers by fixing a set number of bits and interpreting them with positional place values. But integers cannot represent fractions or very large numbers like 6.022 × 10²³. Floating-point representation extends the idea of binary encoding to approximate real numbers, using the same principle as scientific notation: separate a number into a significand (the meaningful digits) and an exponent (the scale). In decimal scientific notation, 0.0042 becomes 4.2 × 10⁻³. IEEE 754 does the same thing in binary: a number is stored as ±1.mantissa × 2^exponent.

A 32-bit single-precision float divides its bits into three fields: 1 sign bit (0 for positive, 1 for negative), 8 exponent bits, and 23 mantissa bits. The exponent uses a biased encoding — the stored value is the actual exponent plus 127, so a stored exponent of 130 means an actual exponent of 3. This avoids the need for two's complement in the exponent field and makes comparison simpler. The mantissa stores only the fractional part of the significand because the leading 1 is implicit — since any nonzero binary number in scientific notation starts with 1 (there is only one nonzero digit in binary), there is no need to store it. This trick gives you 24 bits of precision from 23 stored bits.

The consequence of finite precision is rounding error. The representable floating-point values are not evenly distributed across the number line — they are densely packed near zero and increasingly sparse as magnitude grows. Between 1.0 and 2.0, there are 2²³ (about 8 million) representable values. Between 2.0 and 4.0, there are the same 2²³ values spread over twice the range, so the gap between consecutive representable numbers doubles. This means that adding a tiny number to a large number can produce no change at all: if the small number falls below the gap size at the large number's magnitude, it gets rounded away. The expression `1.0 + 1e-8 == 1.0` evaluates to true in single precision, not because of a bug, but because 10⁻⁸ is smaller than the spacing between representable values near 1.0.

IEEE 754 also defines special values to handle exceptional cases gracefully. Positive and negative infinity result from operations like dividing a positive number by zero. NaN (Not a Number) represents undefined results like 0/0 or √(−1) and has the unique property that NaN ≠ NaN. Negative zero exists because the sign bit is independent of the magnitude — it compares equal to positive zero but preserves sign information for certain mathematical operations. These special values ensure that floating-point arithmetic never traps or halts unexpectedly; every operation produces a defined result, even if that result is "this computation is undefined."

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Boolean Algebra and Fundamental Laws → Logic Gates Fundamentals → Implementing Boolean Functions with Gates → Karnaugh Map Simplification → Combinational Circuit Design → Flip-Flops and Latches → Binary Counters: Design and Analysis → Binary Arithmetic → Fixed-Point Number Representation → Two's Complement Representation → Floating-Point Representation (IEEE 754)

Longest path: 83 steps · 338 total prerequisite topics

Prerequisites (4)

Two's Complement Representationhard Introduction to Scientific Notationsoft Introduction to Exponentssoft Binary Arithmeticsoft

Leads To (2)

Arithmetic Logic Unit (ALU)soft Machine Epsilon and Unit Roundoffhard