Floating Point in computers
description
Transcript of Floating Point in computers
Floating Point in computers
Comply with standards:
IEEE 754
ISO/IEC 559
Timeline
• Introduction quite short• Binary review not so long• Integer Arithmetic 1/3• Floating Point 1/3• Floating Point Arithmetic 1/3• Other issues extra short
Introduction
• Who does computer arithmetic?
• Intel’s spare money
• How is it done in hardware?
• How Integer relates to Floating point
• Now, we go back to “computer structure”
Binary numbers
• What is 1 0 0 1 0 1 1 . 0 0 1 0 1 ?
64 8 2 1
0123456 2222222 54321 22222
81
321
32575
Signed Binary Integers
• Sign-magnitude
• 2’s complement
• 1’s complement
• biased
Sign-Magnitude
• High order bit = Sign
• 0101 = 5
• 1101 = -5
• 2 zero’s
2’s complement
• Number + Negative = 2n
• 0101 = 5
• 1011 = -5
• Easy addition (drop carry)
• Formula: -an-12n-1 + an-22n-2 + … +a121 + a0
1’s Complement
• Negative - complement to 1
• 0101 = 5
• 1010 = -5
• 2 zero’s
• Number + Negative = 2n-1
Biased
• Binary = Number + Bias
• Bias = 5:1101 = 5 5+5=10
0000 = -5 (-5)+5 = 0
• Relative order remains
Integer Arithmetic
Adding (usigned) Integers
• Elementry school :
1 1 0 0 1 1 0 1
1 0 0 0 0 1 1 0+
110
1
0
1
1010
1
1
• Result has n+1 bits!
Adding Integers - hardware
Half Addera b
Cin
s
Cout
a b
s
Cout
Full Adder
2 logical levels
abcbabas
out
bcacabccbacbacbas
out
ininin
Ripple carry Adderan-1 bn-1
sn-1
Cout
an-2 bn-2
Cin
sn-2
a1 b1
s1
a0 b0
s0
• Slow - 2n logical levels
• Small constant (CMOS)
• Other ways exist
Adding Signed Integers
• In 2’s complement:
b + (-a) = b + (2n-a) = 2n + (b-a)
• hence - add as integers, discard carry out
• Example: 0011 + 1100 = ?
= (2n - (b+a)) + 2n= (2n-b)+(2n-a)(-b) + (-a)
Substracting Integers
• Add the negation
• Negating 2’s complement:
11010100101011000110000 = ?00001001010110101001110
Integer (unsigned) Multiplication
• Elementry school : 1 1 0 11 0 0 11 1 0 1
0 0 0 00 0 0 0
1 1 0 1
0 1 1 1 0 1 0 1
*
• Result is 2n bits !
Hardware Multiplier
• P=0
• loop:(i) if A0=1, add B to P
(ii) right-shift P & A
AP
B
Shift
n n
Carry
n
Integer (unsigned) Division
• Elementry school :
1101110
00011
1
11000
0
00001
0
0001
Result: 0100, Rem 1
Dec: 13/3=4, Rem 1
Hardware Divider
• P=0• loop:(i) left-shift P & A
(ii) Sub. B from P:positive: a0=1
negative: a0=0, restore P (add B)
AP
B
Shift
n n+1
n+1
0
Example
• 13 / 3 = 4 (1)
• n=4
• A=1101 B=00011 P=00000
P A B
0 0 0 1 10 0 0 0 0 1 1 0 1
P A B
0 0 0 1 10 0 0 0 1 0 1 0 0
QuotientRemainder
Division - remarks
• Non-restoring Algorithm
• Load P only if positive
• Check for 0
• (Total) Result is 2n bits!
Integer arithmetic - remarks
• Signed Multiply and Division– Algorithms exist– We will not use them
• What to do with extra bits?
• Faster methods
Floating Point
Non Integers - Other Methods
• Fixed Point– example: # # # . #– Binary point shifted– Integer arithmetic (extra shifting)– Small number magnitude
• Rational– a/b (a,bZ)
Floating Point
• Exponent + Significand (= Mantisa)
• x = s • 2e
• Example:
s=101 e=011x = 101 • 211 = 40= 5 • 23 = 101000
Uniqueness
• Denormal Numbers: 123.456 107
0.123 104
• Normalized: #.### 10#
1.123 104
• What about 0 ?
Floating Point Standard
• Why Standartize?– Hardware accelerators– Software compatibility– Build Software Libraries– etc…..
• IEEE 754-1985 ISO/IEC 559
• Includes: Structure, Arithmetic results
Float Types
• 4 Precision Types:– Single– Single extended– Double– Double extended
Single Precision
• 32 bits:
• Exponent (e): Biased ( + 127)
• Significand (f): Fixed fraction: 0 . # # # …
• Nuber: 1.f • 2e-127
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Sign(1) Exponent(8) Significand(23)
Single Precision - Example
• 1 10000001 01000000000000000000000
• 10000001 = 129
• 01000… = 0.01000…
129-127=2
• X = - 1.25 • 22
• X = - 5
1.01= 1.25
Single Precision - Range
• Emax = 127 (e = 254)
• Emin = -126 (e = 1)
• Why |Emin|<|Emax|?– 1/2Emin does not overflow
• Why Biased notation?
• What about 0 and 255 ?
Floating Point Precision
Single SingleExtend
Double DoubleExtend
Format Width 32 43 64 80
Precision 24 32 53 64
Emax +127 1023 1023 16383
Emin -126 -1022 -1022 -16382
Exp. Width 8 11 11 15
Exp. Bias 127 1023
Exmaples
• We shall use base 10 sometimes:
• f will have 3 digits
• Emax will be 98
• Emin will be -97
• Ex: 5.341070
NaN
• Not a Number
• Result of ilegal computation:– – Any computation involving a NaN
• e = Emax + 1 & f 0• # 11111111 #######################
• Many NaN’s (different f’s)
),()0,(0)(00 yremxrem
NaN’s in use
• Zero finder outside domain– f(x) = sqrt(x) - 1
• Works since all computations NaN
• No exception caused !
Zero’s
• 0 00000000 00000000000000000000000 ?• this is NOT 1.02Emin
• 1 00000000 00000000000000000000000 ?
0 is signed! 0 both exits!
• What is the difference?
Signed 0’os
• +0 = -0 BUT:
• Multiply/Divide keep sign rules:
• Monivation:– Using inf correctly (describe later)– log(x) : log(0)=-inf log(negative)=Nan
log(x) if x(-0) ?
)0()0(3)0()0(3
± inf
• More logic:
• e = Emax + 1 & f = 0
• # 11111111 00000000000000000000000
01
01
01
01
)(01 x
Inf usage Example
(If tan-1 is defined properly)
xxx
11tan2)(cos 11
More on 0’os and inf’s
• General Rule for 0/inf arithmetic:– Take appropriate limit:
• 1/(1/x) where x=0 or inf
• Why not Max # instead?
)0(3)(
330
3 limlim)(0
xxxx
704998
707022
105:1016.31099.9
104103
answare
yxyx
Zero’s and inf’s - yet again
• X/(x2+1) is bad!Why?
• 1/(x+x-1) is better
• Do we need to check for x=0?
• Using 2 zero’s and inf’s saves some special cases checks.
Denormalized numbers
• Example:– x=1.23•10-98 y=1.11•10-98
– x-y = 1.20•10 -99 = 0– so: x-y=0 but: x y – think of: if(x y) then z=1/(x-y)
• Soluition:– use denormalized numbers!
Denormal Numbers
• Smallest normal: 1.0 • 2Emin • Below, use denormal: 0.f • 2Emin
• e = Emin - 1& f 0
• # 00000000 #######################• Gradual underflow: 1.23 • 10-4 ( /10 )
0.12 • 10-4 ( /10 )
0.01 • 10-4 ( /10 )
0
Denormal Numbers
• Back to our Example:– x=1.23•10-98 y=1.11•10-98
– x-y = 0.12•10 -98
– and this is not 0 !
Flush to 0 Vs Gradual Underflow
0 2-4 2-3 2-12-2
0 2-4 2-3 2-12-2
Special Values - Summary
Exponent FractionRepresents
Emin-1 f=0 0
Emin-1 f0 0.f2Emin
Emin e Emax ---- 1.f2e
Emax+1 f=0 0
Emax+1 f0 0.f2Emin
Rounding
• Why is rounding needed?
• Infinit numbers Finit representation
• Integers only overflow
• Almost all operations need rounding
• IEEE - specifies algorithms for arithmetic
Numbers need rounding
• Out of range:– x>22Emax x<12Emin
• Between 2 floats:– 0.110 = 0.00011001100….2 = 1.1001100…. 2-4
– 1.1001 2-4
Measuring Error
• ULPS (units in last place)– 1.1210-1 Vs 0.124 : 0.4 ulps– 1.1210-1 Vs 0.118 : 0.2 ulps
• Relative Error– Difference/Original– 1.1210-1 Vs 0.124 : Err=0.004/0.124=0.032
Calculate Using Rounding
• Benign cancellation– Calculate 10.1-9.93 (= 0.17)
1.01 101
0.99 101
0.02 101 = 2.00 10-1
– 30 upls!
Rounding problems
• Catastrophic cancellation– b2-4ac– both b2 and 4ac are rounded– the (-) exposes the error– b=3.34 a=1.22 c=2.28
b2=11.2 4ac=11.1 b2-4ac=0.10
correct=0.0292 (70.08 upls)
IEEE Arithmetic
• Requirement:+ - shold be EXACTLY rounded
remainder shold be EXACTLY rounded
Integer conv. shold be EXACTLY rounded
• Not all (transcendental, binary to decimal)
• “Tie break” - Round to Even
Round to Even
• How will 1.005 be rounded ?– Round Up: 1.01– Round Even: 1.00
• Why? Example:– xi=xi-1+y-y x0=1.00 y=0.125
– Round up: 1.00, 1.01, 1.02, ….– Round even: 1.00, 1.00, 1.00, ….
Float Multiplication
2121 2)()2()2( 2121eeee ssss
Integer multiply
Biased addition
•“Biased addition”:
detect Overflow: Use n+1 bit adder
detect Underflow: Harder (Denormals)
)127(127)127()127( 321 eee
Rounding Multiplication 1.23 6.788.3394
X
Round to 8.34
2.83 4.4712.6501
X
Round to 1.27
1.28 7.8109.9968
X
Round to 1.00
1.00011 1.00100 1.00101 0.11010
Round bit 0 Round bit 1All rest 0
Round bit 1All rest 0
Shift needed
Round, Guard, Sticky
0 . 1 1 0 1 0 0 0 1 0
number guard round sticky
1 . 0 0 1 0 0 0 1 0 0
number round sticky
Rounding Multiplication
AP
B
Shift
n n
Carry
n
x0x1.x2x3x4x5 g r s s s s
x1.x2x3x4x5 g
X0.x1x2x3x4x5
Case 1: x0=0, shift
Case 2: x0=1, inc. exp
Product Results:
Round digit
Sticky bit
Rounding rules
• r=0 rounded OK
• r=1, s=1 add 1 to LSB
• r=1, s=0 add 1 if LSB=1
• Denormals Extra shifting
Float addition
• Compute all digits and round?– 1.00220 + 1.00 2-20 = 10000000….0000001– too long!
• Use Round and Sticky bits:– shift to same exponent– r = first discarded digit– s = OR of rest discarded
Float addition - example
1.10011 .000011.10100
+
r=1, s=1Round needed! 1.10101
Calculate: 1.1001120 + 1.100012-5
Shift exponents: 1.1001120 + 0.000011000120
r=1 s=0|0|0|1=1
Signed Addition/Substraction
• Simplest way - convert to 2’s cmpl.
• Cancellation of high order bit - shift
• more bits cancel - How many guard digits?
1.000001.111110.11111
+1.000000.00000101111
-1.11111010001cmpl
Float Division
2121 2)()2()2( 2121eeee ssss
Integer division
Biased substraction
• Very similar to Multiplication
• Dividing using integer divide
• Compute 2 more bits (round, guard)
• Use remainder as sticky bit (Why?)
• Sign bit: XOR
More on floats
Rounding modes
• IEEE specifies 4 modes:– Nearest (default)– towards 0– towards +inf– towards -inf
• affects overflow (How?)
Exceptions
• Set a flag at:– Underflow 1.02Emin x 1.02Emin
– Overflow 1.02Emax x 1.02Emax
– divide by 0 1/0– inexact Rounded was needed– invalid NaN return
operations
• flags are sticky
Speeding up
• Different algorithms may be used
• Result should be exact
• divide SRT algorithm in pentium– 5/2048 entries in a table– 1/9,000,000 chance– check:
Precision
• Why extended precisions?
– Return higher accuracy (D*Dext. D)
– use for computations: 22 yx