将双精度数舍入到以位数给出的较低精度的有效方法

在C＃中，我希望将双精度舍入到较低的精度，以便我可以将它们存储在关联数组中的不同大小的存储桶中。与通常的舍入不同，我想要舍入到一些重要的位。因此，大数字的绝对值会比小数字更改，但它们往往会按比例改变。因此，如果我想要舍入到10个二进制数字，我会找到十个最高有效位，并将所有低位都清零，可能会添加一个小数字进行舍入。

我更喜欢将“中途”数字四舍五入。

如果它是整数类型，这将是一个可能的算法：

1. Find: zero-based index of the most significant binary digit set H. 2. Compute: B = H - P, where P is the number of significant digits of precision to round and B is the binary digit to start rounding, where B = 0 is the ones place, B = 1 is the twos place, etc. 3. Add: x = x + 2^B This will force a carry if necessary (we round halfway values up). 4. Zero out: x = x mod 2^(B+1). This clears the B place and all lower digits.

问题是找到找到最高位集的有效方法。如果我使用整数，那么找到MSB就会有很酷的攻击。如果我可以帮助它，我不想调用Round（Log2（x））。这个function将被调用数百万次。

注意：我已阅读此SO问题：

将双精度值舍入到（稍微）较低精度的好方法是什么？

它适用于C ++。我正在使用C＃。

更新：

这是我使用它时的代码（根据回答者提供的内容进行了修改）：

 ///  /// Round numbers to a specified number of significant binary digits. /// /// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits, /// but larger numbers lose precision: /// /// 8 1000 => 1000 8 /// 9 1001 => 1010 10 /// 10 1010 => 1010 10 /// 11 1011 => 1100 12 /// 12 1100 => 1100 12 /// 13 1101 => 1110 14 /// 14 1110 => 1110 14 /// 15 1111 =>10000 16 /// 16 10000 =>10000 16 /// /// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right /// in binary digits from the highest bit set, not the distance to the left from the zero bit. /// 
 /// Number to be rounded. /// Number of binary digits of precision to preserve.  public static double AdjustPrecision(this double d, int digits) { // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research. var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored. ulong significandMask = (0xffffffffffffffffUL >> shift) < fixed point (sorta) ulong toLong = *(ulong*)(&local_d); // mask off your least-sig bits var modLong = toLong & significandMask; // fixed point -> float (sorta) local_d = *(double*)(&modLong); } return local_d; }

更新2：Dekker的算法

我从Dekker的算法中得出了这个，感谢另一位受访者。它舍入到最接近的值，而不是像上面的代码那样截断，它只使用安全代码：

 private static double[] PowersOfTwoPlusOne; static NumericalAlgorithms() { PowersOfTwoPlusOne = new double[54]; for (var i = 0; i < PowersOfTwoPlusOne.Length; i++) { if (i == 0) PowersOfTwoPlusOne[i] = 1; // Special case. else { long two_to_i_plus_one = (1L << i) + 1L; PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one; } } } public static double AdjustPrecisionSafely(this double d, int digits) { double t = d * PowersOfTwoPlusOne[53 - digits]; double adjusted = t - (t - d); return adjusted; }

更新2：时间安排

我进行了测试，发现Dekker的算法比TWICE快得多！

测试中的呼叫数量：100,000,000
不安全时间= 1.922（秒）
安全时间= 0.799（秒）

Dekker的算法将浮点数分成高低部分。如果有效数据中有s位（IEEE 754 64位二进制中有53位），则*x0接收高s – b位，这是您请求的位， *x1接收剩余位，您可以丢弃这些位。在下面的代码中， Scale应具有值^2b 。如果b在编译时已知，例如常量43，则可以用0x1p43替换Scale 。否则，你必须以某种方式产生2 ^b 。

这需要圆到最近的模式。 IEEE 754算术就足够了，但其他合理的算法也可以。它将关系变为偶数，这不是你要求的（向上绑定）。这有必要吗？

这假设x * (Scale + 1)不会溢出。必须以双精度（不大于）精度评估操作。

 void Split(double *x0, double *x1, double x) { double d = x * (Scale + 1); double t = d - x; *x0 = d - t; *x1 = x - *x0; }

有趣……从来没有听说过需要这个，但我认为你可以通过一些时髦的不安全代码“做到”……

 void Main() { // how many bits you want "saved" var maxBits = 20; // create a mask like 0x1111000 where # of 1's == maxBits var shift = (sizeof(int) * 8) - maxBits; var maxBitsMask = (0xffffffff >> shift) << shift; // some floats var floats = new []{ 1.04125f, 2.19412347f, 3.1415926f}; foreach (var f in floats) { var localf = f; unsafe { // float -> fixed point (sorta) int toInt = *(int*)(&localf); // mask off your least-sig bits var modInt = toInt & maxBitsMask; // fixed point -> float (sorta) localf = *(float*)(&modInt); } Console.WriteLine("Was {0}, now {1}", f, localf); } }

并且有双打：

 void Main() { var maxBits = 50; var shift = (sizeof(long) * 8) - maxBits; var maxBitsMask = (0xffffffffffffffff >> shift) << shift; var doubles = new []{ 1412.04125, 22.19412347, 3.1415926}; foreach (var d in doubles) { var local = d; unsafe { var toLong = *(ulong*)(&local); var modLong = toLong & maxBitsMask; local = *(double*)(&modLong); } Console.WriteLine("Was {0}, now {1}", d, local); } }

哇......我没有接受。 🙂

为了完整起见，这里使用Jeppe的“不安全”方法：

 void Main() { var maxBits = 50; var shift = (sizeof(long) * 8) - maxBits; var maxBitsMask = (long)((0xffffffffffffffff >> shift) << shift); var doubles = new []{ 1412.04125, 22.19412347, 3.1415926}; foreach (var d in doubles) { var local = d; var asLong = BitConverter.DoubleToInt64Bits(d); var modLong = asLong & maxBitsMask; local = BitConverter.Int64BitsToDouble(modLong); Console.WriteLine("Was {0}, now {1}", d, local); } }

将双精度数舍入到以位数给出的较低精度的有效方法

C＃中的“var”类型推断

文件通过visual studio下载，但不是通过.exe下载

在方法中编写“返回”的完美方式是什么？

在C＃中创建COM自动化服务器

MongoDB c＃驱动程序 – 名为Id的字段可以不是Id吗？

在ASP.NET中创建CAPTCHA代码的代码？

如何在不弄乱DataContext的情况下为WPF工具提示设置PlacementTarget？

您能用Google的协议缓冲区格式表示CSV数据吗？

如何在Metro应用程序中打印TextBox的TextFile OR内容？

如何使用C＃连接到Mysql？