The 'doubledouble' type  CodeProject
: 11
Introduction
Many scientific applications require accuracy greater than that provided by ‘double precision’. However, no hardware implementation of ‘quadruple precision’ exists, and the cost of implementation in software is often prohibitive. This article describes an effective compromise, implemented using the ‘doubleprecision’ type.
Disclaimer
The current implementation has not been thoroughly tested. While every attempt has been made to ensure accuracy, there are probably cases in which the code does not behave as expected. Do not trust your career or reputation to this code without thorough testing!
Enhancements
Version 1.1
 Accuracy improvements for most functions (the acos(), atan(), asinh() functions still need improvement).
 A test program, TestQD, has been added to the solution in order to provide basic "sanity checks" of the functions.
 Bug fixes  fmod(), round(), stream output.
 The fma() function has been renamed Fma() in order to avoid incompatibility with the VS2013 runtime library.
Version 1.1.1
 Bug fix  pow(negative, integer) now returns the correct value
 Bug fix  0.0 / 0.0 == NaN
 The Visual Studio solution provided uses Visual Studio 2013. For those still using VS 2012/2010, the solution and project files from version 1.1 should still work (no file names have been changed/added).
Version 1.1.3
 Bug fixes  (unsigned) int, (unsigned) long, and (unsigned) long long constructors
 Bug fixes  toInt(), toLong(), and toLongLong() converters
Background
I came across this solution for extending precision when performing a simulation of Brownian movement. The program required updating the positions of atoms in the simulation by extremely small amounts, which meant that the increment was rounded (and sometimes  shifted completely out). The only way to ensure that the calculations were accurate was to use higher precision.
Using the code
The code is provided as a Visual Studio 2010 project. The files qd\config.h and include\qd\qd_config.h contain macros that should be modified in order to support different compilers. In particular, some of these APIs are present in the standard C++11 library (but not in earlier versions).
With one exception, the code is designed as a dropin replacement for the 'double' type. For example, if your original code is:
#include <iostream>
#include <math>
int main(void)
{
double two = 2.0;
std::cout << std::sqrt( two );
return 0;
}
then the code could be rewritten to use the 'doubledouble' type as follows:
#include <iostream>
#include "dd_real.h"
int main(void)
{
dd_real two = "2.0";
std::cout << std::sqrt( two );
return 0;
}
The exception is initialization to a constant. In order to initialize a variable to full 'doubledouble' precision, the value must be passed as a string. This is because a value that is not passed as a string will be rounded at compilation time to 'double' precision, which misses the point.
The IEEE 754 Standard for FloatingPoint Arithmetic vs 'doubledouble'
The IEEE 7542008 Standard for FloatingPoint Arithmetic ^{[1]} defines three types for binary arithmetic – binary32, binary64, and binary128. Their properties are summarized in table 1.
Sign  Exponent  Mantissa  Hardware?  

binary32  1 bit  8 bits  24 bits  Y  'float' 
binary64  1 bit  11 bits  53 bits  Y  'double' 
binary128  1 bit  15 bits  113 bits  N 
 In the binary representations, the MSB of the mantissa is implicit. This accounts for the ‘extra’ bit in the above representations.
 While the binary32 and binary64 types are typically implemented in hardware, binary128 currently has no hardware implementation. Given that software implementations of binary64 are typically 10 to 100 times slower than hardware implementations, this implies that a software implementation of binary128 will be prohibitively slow.
The doubledouble type is a compromise between the extra precision of binary128 and the speed of binary64. It relies on properties of IEEE 754 floating point numbers to represent numbers in pairs of binary64 values, such that the sum of the pair represents the number. For best results, there should be no overlap between the bits in the pair, i.e. the absolute value of the smaller number should be no larger than 2^{53} of the absolute value of the larger.
The precision achieved in this manner is slightly lower than binary128 (106 bits rather than 113), but the exponent range is slightly smaller than that of binary64. The reason for this is that many operations will only work if neither number in the pair underflows. This means that the larger number must be at least 2^{53}*epsilon (the smallest binary64 number).
In this article, I present the basic algorithms without proof. Those interested in the proofs may find them in the references.
Preconditions
The implementation of doubledouble relies on the following conditions:

All rounding is to “nearest or even”

No “hidden bits” are involved in the calculation – operations are always rounded to binary64 with no intermediate step(s)
[2] implies that using an 80x87 in its standard mode (64bit precision) won’t work; it has to be set to 53bit precision. On the other hand, a program using SSE2 instructions will work fine. Note that Windows by default places the 80x87 in 53bit mode, in an attempt to be compatible with the SSE/SSE2 instructions.
Basic Operations ^{[2]}
The implementation of ‘doubledouble’ arithmetic relies on the ability to perform certain floatingpoint operations exactly. In all of the following, the notation RN( ) implies that the operation should be performed with rounding to nearest or even. It must be emphasized that these operations are only valid if neither overflow nor underflow occur during the operations. For example, addition of two likesigned infinities will produce a result of (inf, NaN), rather than the expected (inf, 0).
It is possible to make these basic operations behave as expected, but only at the cost of an additional test and (possible) branch.
The basic operations are as follows:
Addition/Subtraction ^{[3, 4, 5]}
Adds/subtracts two binary64 numbers, returning an exact result – the larger in magnitude is the rounded result of the addition/subtraction, and the smaller number is the residue from the exact result.
Two variants exist:
(s, t) = Fast2Sum(a, b)^{[3]} – requires that exponent(a) >= exponent(b)
s = RN( a + b ) t1 = RN( s  a ) t = RN( b  t1 )
(s, t) = 2Sum(a, b)^{[4,5]} – no preconditions on a, b
s = RN( a + b ) t1 = RN( s  b ) t2 = RN( s  t1 ) t3 = RN( a  t1 ) t4 = RN( b  t2 ) t = RN( t3 + t4 )
Veltkamp Splitting ^{[3]}
Takes a floatingpoint number x of precision p, and splits it into two floatingpoint numbers (x_{h}, x_{l}) so that:
 x_{h} has ps significant bits
 x_{l} has s significant bits
This assumes that no overflow occurs during the calculations. A quirk of binary floatingpoint is that x_{l} will actually fit into s1 bits (the sign bit of x_{l} is used as an additional bit). For binary64 (p=53), using s=27 means that both halves of the number will fit into 26 bits.
(x_{h}, x_{l}) = Split(x, s)
Define: C = 2^s + 1
t1 = RN( C * x )
t2 = RN( x – t1 )
xh = RN( t1 + t2 )
xl = RN( x – xh )
Multiplication ^{[2, 3]}
Multiplies two binary64 numbers, returning an exact result  the larger in magnitude is the rounded result of the multiplication, while the smaller is the resiudue from the exact result.
(r1, r2) = 2Mult(x, y) – AKA Dekker Product
Define: s = 27
(xh, xl) = Split( x, s )
(yh, yl ) = Split( y, s )
r1 = RN( x * y )
t1 = RN( r1 + RN( xh * yh ) )
t2 = RN( t1 + RN( xh * yl ) )
t3 = RN( t2 + RN( xl * yh ) )
r2 = RN( t3 + RN( xl * yl ) )
If a hardware fusedmultiplyaccumulate instruction is available, the following, much faster, implementation is possible:
(r1, r2) = 2MultFMA(x, y)
r1 = RN( x * y ) r2 = fma( x, y, r1 )
The ‘doubledouble’ Type
In the following, I base my examples on the code published by Bailey et al. in their QD library ^{[6,7]}. While the basic implementation is theirs, I have made the following enhancements to the code:
Functional:
 The implementations of trigonometric and hyperbolic functions have been rewritten for greater accuracy and speed. In particular, the reduction of trigonometric function arguments is now performed using the excellent code in fdlibm ^{[8]}
 Additional functions, not defined in the original QD library, have been added (e.g. expm1, logp1, cbrt, etc.)
Code organization:
 The original code may be configured to use faster, but less accurate basic functions (addition, multiplication, division). The less accurate versions of these functions have been removed
 All code for the ‘doubledouble’ type (with exception of class dd_real) has been moved into the qd namespace
 All helper functions have been moved into the dd_real class, so as not to pollute the global namespace
 If a mathematical function is defined in the C++ standard library for the ‘double’ type, it is also defined in the std namespace for the dd_real type
General
It must be emphasized that the ‘doubledouble’ type is an approximation to a 106bit floatingpoint type. It differs from the IEEE 754 binary128 (113 bits) type in many ways:
 Correct rounding is not guaranteed
While every attempt is made to provide ‘faithful’ rounding (i.e. return one of the values closest to the infinitelyprecise value), no guarantee is made that the returned value is actually the closest of the two.
A number such as 1 + 2^{200} may easily be represented in ‘doubledouble’ format, but is not a valid IEEE 754 binary128 number. This ‘wobbling’ precision complicates error analysis and other issues.
Full implementation of infinity and NaN arithmetic requires testing for normal operands and for normal results at all stages of the operation. It is possible to write a cheap test that will work for nonnormal (infinite / nan ) and for most normal operands, but it is possible that results within (1  2^{53}) of the largest magnitude will trigger an erroneous overflow condition.
 Theorems and lemmas relating to floatingpoint
The IEEE 754 floating point type obeys a large number of theorems that make error analysis etc. more tractable. Most of these theorems do not apply to 'doubledouble'
Basic Arithmetic
The basic idea is to use the exact operations defined above to calculate an approximation to the exact operation. Note that only the sum of the two components of the result is meaningful. In particular, the high part of the result does not necessarily equal the result of operating on the high parts, e.g. z_{h} <> x_{h} + y_{h}!
Note that optimized versions of the operators exist, e.g. for adding a 'double' to a 'doubledouble', etc.
Addition / Subtraction
(z_{h}, z_{l}) = ddAdd( (x_{h}, x_{l}), (y_{h}, y_{l}) )
(sh, sl) = 2Sum(xh, yh) (th, tl) = 2Sum(xl, yl) c = RN(sl + th) (vh, vl) = Fast2Sum(sh, c) w = RN( tl + vl ) (zh, zl) = Fast2Sum(vh, w)
Multiplication
(p_{h}, p_{l}) = ddMul( (x_{h}, x_{l}), (y_{h}, y_{l}) )
Note that the product of x_{l} and y_{l} is not calculated. This is because it can never contribute to the lower half of the result.
(ph, pl) = 2Prod(xh, yh) pl = RN(pl + RN(xh * yl)) pl = RN(pl + RN(xl * yh))
Division
(p_{h}, p_{l}) = ddDiv( (x_{h}, x_{l}), (y_{h}, y_{l}) )
No exact method for performing division exists. Instead, a first approximation is calculated as x_{h} / y_{h}, which is then refined by NewtonRaphson iteration.
Polynomial Functions
sqrt() and cbrt() are calculated using variants of NewtonRaphson iteration.
hypot() is calculated using an algorithm that ensures accuracy for all values.
pow() is calculated as follows:
 If the exponent is integral and smaller than LLONG_MAX, the value is calculated using successive squaring and multiplications.
 If the exponent is nonintegral, but smaller than LLONG_MAX, the exponent is split into integral and nonintegral components. The integral exponent is calculated as above, and the fractional exponent is calculated as exp( x* log( frac(y) ).
 If the exponent is larger than LLONG_MAX, it is calculated as exp( x * log( y ) )
Trigonometric Functions
For trigonometric functions, the argument is first reduced to the range [pi/4, pi/4] using the pi/2 remainder algorithm provided in fdlibm ^{[8]}. This is then divided by 8 in order to get a value in the range [pi/32, pi/32]. The sine and cosine are calculated as a polynomial, and the tangent is calculated as a ratio between the sine and cosine.
For the inverse functions, NewtonRaphson iterations are performed in order to calculate, e.g. y where tan( y ) – x = 0.
Logarithms
The logarithm functions (log, log10, log2) are calculated by multiplying x by the appropriate constant, and then calculating log2(C * x). We use log2() as the base function because it allows us to easily extract the integer component of the result (using the frexp() API), leaving only the logarithm of the fraction to be calculated.
Exponentials
The functions exp, exp2, expm1 are all calculated by reducing the argument to less than 1, and calculating the Taylor series. For expm1, the zeroth term of the Taylor series is not added.
Note that if the argument of expm1 is large enough, the result is identical to exp(x)  1.0.
Hyperbolic Functions
With the exception of sinh(x) where x is close to zero, these functions are calculated by using the exp() function as appropriate. For sinh(x) where x is close to zero, the Taylor series for the function is used in an attempt to reduce catastrophic cancellation.
Utility Functions
ldexp, frexp, fmod, modf, fma, copysign, etc. all exist and are defined in a manner identical to that of the similar functions in the C++ Standard.
std::numeric_limits is fully defined.
Constants
The functions dd_pi(), dd_e(), dd_ln10(), etc. return the expected values, rounded to nearest or even.
Known Bugs
 The accuracy of some of the inverse trigonometric (acos(), atan()) and inverse hyperbolic functions (asin()) needs to be improved.
References

IEEE Standard for FloatingPoint Arithmetic. IEEE Standard 7542008, Aug. 2008

Handbook of FloatingPoint Arithmetic. J.M. Muller et al., Birkhauser 201x

A floatingpoint technique for extending the available precision. T.J. Dekker, Numerische Mathematik 18(3):224242, 1971

The Art of Computer Programming vol. 2. D. Knuth, AddisonWesley Reading MA, 3^{rd} Edition, 1998

Quasidouble precision in floatingpoint addition. O. Moller. BIT 5:3750, 1965

Algorithms for quaddouble precision floatingpoint arithmetic. Y. Hida, X.S. Li, D.H. Bailey, Proceedings of the 15th IEEE Symposium on Computer Arithmetic pp 155162, 2001

The original QD library is available at http://crd.lbl.gov/~dhbailey/mpdist

The fdlibm library is available at http://www.netlib.org/fdlibm

Software Manual for the Elementary Functions. W. J. Cody Jr. & W. Waite, PrenticeHall, 1980
History
20150314 Original version 20150331 Bug fixes, basic test program