字节数组的Base-N编码

几天前，我遇到了这个用于Base-36编码字节数组的CodeReview 。然而，接下来的答案没有触及解码回字节数组，或者可能重复使用答案来执行不同基数（基数）的编码。

链接问题的答案使用BigInteger。因此，就实现而言，可以对基数及其数字进行参数化。

但是，BigInteger的问题在于我们将输入视为假定的整数。但是，我们的输入（字节数组）只是一系列不透明的值。

如果字节数组以一系列零字节结束，例如{0xFF，0x7F，0x00,0x00}，那么在答案中使用算法时这些字节将丢失（仅编码{0xFF，0x7F}。
如果最后一个非零字节的符号位置位，那么前一个零字节将被消耗，因为它被视为BigInt的符号分隔符。所以{0xFF，0xFF，0x00,0x00}只能编码为{0xFF，0xFF，0x00}。

.NET程序员如何使用BigInteger创建一个合理有效且基数不可知的编码器，具有解码支持，以及处理字节序的能力，以及“解决”结束零字节丢失的能力？

编辑 [2016/04/19]：如果您喜欢exception，您可能希望更改一些Decode实现代码以抛出InvalidDataException而不是仅返回null。

edit [2014/09/14]：我在Encode（）中添加了一个’HACK’来处理输入中最后一个字节被签名的情况（如果要转换为sbyte）。我现在能想到的唯一合理的解决方案是将数组Resize（）一个。这种情况的附加unit testing通过，但我没有重新运行perf代码来解释这种情况。如果您可以提供帮助，请始终将Encode（）的输入包含在末尾的0字节，以避免额外的分配。

用法

我已经创建了一个RadixEncoding类（在“代码”部分中找到），它使用三个参数进行初始化：

基数作为字符串（长度当然决定了实际的基数），
输入字节数组的假定字节排序（endian），
并且用户是否希望编码/解码逻辑确认结束零字节。

要创建Base-36编码，使用little-endian输入，并相对于结束零字节：

 const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz"; var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, false);

然后实际执行编码/解码：

 const string k_input = "A test 1234"; byte[] input_bytes = System.Text.Encoding.UTF8.GetBytes(k_input); string encoded_string = base36_no_zeros.Encode(input_bytes); byte[] decoded_bytes = base36_no_zeros.Decode(encoded_string);

性能

与Diagnostics.Stopwatch同步，在i7 860 @ 2.80GHz上运行。时序EXE独立运行，而不是在调试器下运行。

使用与上面相同的k_base36_digits字符串EndianFormat.Little初始化编码，并确认结束零字节 （即使UTF8字节没有任何额外的结束零字节）

要对“A test 1234”的UTF8字节进行编码1,000,000次，需要2.6567905secs
要解码相同的字符串，相同的时间需要3.3916248secs

编码UTF8字节“A test 1234.稍大一点！” 100,000次需要1.1577325秒
要解码相同的字符串，相同的时间需要1.244326secs

码

如果您没有CodeContracts生成器，则必须使用if / throw代码重新实现合同。

 using System; using System.Collections.Generic; using System.Numerics; using Contract = System.Diagnostics.Contracts.Contract; public enum EndianFormat { /// Least Significant Bit order (lsb)
 /// Right-to-Left ///  Little, /// Most Significant Bit order (msb)
 /// Left-to-Right Big, }; /// Encodes/decodes bytes to/from a string
 ///  /// Encoded string is always in big-endian ordering /// /// Encode and Decode take a includeProceedingZeros parameter which acts as a work-around /// for an edge case with our BigInteger implementation. /// MSDN says BigInteger byte arrays are in LSB->MSB ordering. So a byte buffer with zeros at the /// end will have those zeros ignored in the resulting encoded radix string. /// If such a loss in precision absolutely cannot occur pass true to includeProceedingZeros /// and for a tiny bit of extra processing it will handle the padding of zero digits (encoding) /// or bytes (decoding).
 /// Note: doing this for decoding may add an extra byte more than what was originally /// given to Encode.
 ///  // Based on the answers from http://codereview.stackexchange.com/questions/14084/base-36-encoding-of-a-byte-array/ public class RadixEncoding { const int kByteBitCount = 8; readonly string kDigits; readonly double kBitsPerDigit; readonly BigInteger kRadixBig; readonly EndianFormat kEndian; readonly bool kIncludeProceedingZeros; /// Numerial base of this encoding
 public int Radix { get { return kDigits.Length; } } /// Endian ordering of bytes input to Encode and output by Decode
 public EndianFormat Endian { get { return kEndian; } } /// True if we want ending zero bytes to be encoded
 public bool IncludeProceedingZeros { get { return kIncludeProceedingZeros; } } public override string ToString() { return string.Format("Base-{0} {1}", Radix.ToString(), kDigits); } /// Create a radix encoder using the given characters as the digits in the radix
 /// Digits to use for the radix-encoded string /// Endian ordering of bytes input to Encode and output by Decode /// True if we want ending zero bytes to be encoded public RadixEncoding(string digits, EndianFormat bytesEndian = EndianFormat.Little, bool includeProceedingZeros = false) { Contract.Requires(digits != null); int radix = digits.Length; kDigits = digits; kBitsPerDigit = System.Math.Log(radix, 2); kRadixBig = new BigInteger(radix); kEndian = bytesEndian; kIncludeProceedingZeros = includeProceedingZeros; } // Number of characters needed for encoding the specified number of bytes int EncodingCharsCount(int bytesLength) { return (int)Math.Ceiling((bytesLength * kByteBitCount) / kBitsPerDigit); } // Number of bytes needed to decoding the specified number of characters int DecodingBytesCount(int charsCount) { return (int)Math.Ceiling((charsCount * kBitsPerDigit) / kByteBitCount); } /// Encode a byte array into a radix-encoded string
 /// byte array to encode /// The bytes in encoded into a radix-encoded string /// If  is zero length, returns an empty string public string Encode(byte[] bytes) { Contract.Requires(bytes != null); Contract.Ensures(Contract.Result() != null); // Don't really have to do this, our code will build this result (empty string), // but why not catch the condition before doing work? if (bytes.Length == 0) return string.Empty; // if the array ends with zeros, having the capacity set to this will help us know how much // 'padding' we will need to add int result_length = EncodingCharsCount(bytes.Length); // List<> has a(n in-place) Reverse method. StringBuilder doesn't. That's why. var result = new List(result_length); // HACK: BigInteger uses the last byte as the 'sign' byte. If that byte's MSB is set, // we need to pad the input with an extra 0 (ie, make it positive) if ( (bytes[bytes.Length-1] & 0x80) == 0x80 ) Array.Resize(ref bytes, bytes.Length+1); var dividend = new BigInteger(bytes); // IsZero's computation is less complex than evaluating "dividend > 0" // which invokes BigInteger.CompareTo(BigInteger) while (!dividend.IsZero) { BigInteger remainder; dividend = BigInteger.DivRem(dividend, kRadixBig, out remainder); int digit_index = System.Math.Abs((int)remainder); result.Add(kDigits[digit_index]); } if (kIncludeProceedingZeros) for (int x = result.Count; x < result.Capacity; x++) result.Add(kDigits[0]); // pad with the character that represents 'zero' // orientate the characters in big-endian ordering if (kEndian == EndianFormat.Little) result.Reverse(); // If we didn't end up adding padding, ToArray will end up returning a TrimExcess'd array, // so nothing wasted return new string(result.ToArray()); } void DecodeImplPadResult(ref byte[] result, int padCount) { if (padCount > 0) { int new_length = result.Length + DecodingBytesCount(padCount); Array.Resize(ref result, new_length); // new bytes will be zero, just the way we want it } } #region Decode (Little Endian) byte[] DecodeImpl(string chars, int startIndex = 0) { var bi = new BigInteger(); for (int x = startIndex; x < chars.Length; x++) { int i = kDigits.IndexOf(chars[x]); if (i < 0) return null; // invalid character bi *= kRadixBig; bi += i; } return bi.ToByteArray(); } byte[] DecodeImplWithPadding(string chars) { int pad_count = 0; for (int x = 0; x < chars.Length; x++, pad_count++) if (chars[x] != kDigits[0]) break; var result = DecodeImpl(chars, pad_count); DecodeImplPadResult(ref result, pad_count); return result; } #endregion #region Decode (Big Endian) byte[] DecodeImplReversed(string chars, int startIndex = 0) { var bi = new BigInteger(); for (int x = (chars.Length-1)-startIndex; x >= 0; x--) { int i = kDigits.IndexOf(chars[x]); if (i < 0) return null; // invalid character bi *= kRadixBig; bi += i; } return bi.ToByteArray(); } byte[] DecodeImplReversedWithPadding(string chars) { int pad_count = 0; for (int x = chars.Length - 1; x >= 0; x--, pad_count++) if (chars[x] != kDigits[0]) break; var result = DecodeImplReversed(chars, pad_count); DecodeImplPadResult(ref result, pad_count); return result; } #endregion /// Decode a radix-encoded string into a byte array
 /// radix string /// The decoded bytes, or null if an invalid character is encountered ///  /// If  is an empty string, returns a zero length array /// /// Using  has the potential to return a buffer with an /// additional zero byte that wasn't in the input. So a 4 byte buffer was encoded, this could end up /// returning a 5 byte buffer, with the extra byte being null. ///  public byte[] Decode(string radixChars) { Contract.Requires(radixChars != null); if (kEndian == EndianFormat.Big) return kIncludeProceedingZeros ? DecodeImplReversedWithPadding(radixChars) : DecodeImplReversed(radixChars); else return kIncludeProceedingZeros ? DecodeImplWithPadding(radixChars) : DecodeImpl(radixChars); } };

基本unit testing

 using System; using Microsoft.VisualStudio.TestTools.UnitTesting; static bool ArraysCompareN(T[] input, T[] output) where T : IEquatable { if (output.Length < input.Length) return false; for (int x = 0; x < input.Length; x++) if(!output[x].Equals(input[x])) return false; return true; } static bool RadixEncodingTest(RadixEncoding encoding, byte[] bytes) { string encoded = encoding.Encode(bytes); byte[] decoded = encoding.Decode(encoded); return ArraysCompareN(bytes, decoded); } [TestMethod] public void TestRadixEncoding() { const string k_base36_digits = "0123456789abcdefghijklmnopqrstuvwxyz"; var base36 = new RadixEncoding(k_base36_digits, EndianFormat.Little, true); var base36_no_zeros = new RadixEncoding(k_base36_digits, EndianFormat.Little, true); byte[] ends_with_zero_neg = { 0xFF, 0xFF, 0x00, 0x00 }; byte[] ends_with_zero_pos = { 0xFF, 0x7F, 0x00, 0x00 }; byte[] text = System.Text.Encoding.ASCII.GetBytes("A test 1234"); Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_neg)); Assert.IsTrue(RadixEncodingTest(base36, ends_with_zero_pos)); Assert.IsTrue(RadixEncodingTest(base36_no_zeros, text)); }

有趣的是，我能够将Kornman的技术移植到Java并获得预期的输出，包括base36。而在跑他的时候呢？来自c＃的代码使用C：\ Windows \ Microsoft.NET \ Framework \ v4.0.30319 csc，输出不符合预期。

例如，尝试使用Kornman的RadixEncoding编码对下面的字符串“hello world”进行base16编码获得的MD5 hashBytes，我可以看到每个字符的两个字节组的字节顺序错误。

而不是5eb63bbbe01eeed093cb22bb8f5acdc3

我看到像e56bb3bb0ee1 ……

这是在Windows 7上。

 const string input = "hello world"; public static void Main(string[] args) { using (System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create()) { byte[] inputBytes = System.Text.Encoding.ASCII.GetBytes(input); byte[] hashBytes = md5.ComputeHash(inputBytes); // Convert the byte array to hexadecimal string StringBuilder sb = new StringBuilder(); for (int i = 0; i < hashBytes.Length; i++) { sb.Append(hashBytes[i].ToString("X2")); } Console.WriteLine(sb.ToString()); } }

任何感兴趣的人都可以使用Java代码。如上所述，它仅适用于36。

 private static final char[] BASE16_CHARS = "0123456789abcdef".toCharArray(); private static final BigInteger BIGINT_16 = BigInteger.valueOf(16); private static final char[] BASE36_CHARS = "0123456789abcdefghijklmnopqrstuvwxyz".toCharArray(); private static final BigInteger BIGINT_36 = BigInteger.valueOf(36); public static String toBaseX(byte[] bytes, BigInteger base, char[] chars) { if (bytes == null) { return null; } final int bitsPerByte = 8; double bitsPerDigit = Math.log(chars.length) / Math.log(2); // Number of chars to encode specified bytes int size = (int) Math.ceil((bytes.length * bitsPerByte) / bitsPerDigit); StringBuilder sb = new StringBuilder(size); for (BigInteger value = new BigInteger(bytes); !value.equals(BigInteger.ZERO);) { BigInteger[] quotientAndRemainder = value.divideAndRemainder(base); sb.insert(0, chars[Math.abs(quotientAndRemainder[1].intValue())]); value = quotientAndRemainder[0]; } return sb.toString(); }

字节数组的Base-N编码

用法

性能

码

基本unit testing

为Elastic Search指定和使用带有C＃NEST客户端的NGramTokenizer

什么是合适的NHibernate / Iesi.Collections.Generic.ISet 替换？

preCondition =“managedHandler”如何为模块工作？

C＃。如果（a ==（b或c或d））。可能吗？

如何为通用Windows应用程序设置固定窗口大小

C＃中的GetType（）和Typeof（）

“OperationContext.Current.GetCallbackChannel”实际上做了什么？

使用Fluent API的EF外键

System.DirectoryServices.Interop.UnsafeNativeMethods.IAds.GetInfo（）中的FileNotFoundException

无法使用migrate.exe运行代码首次迁移

字节数组的Base-N编码

用法

性能

码

基本unit testing

为Elastic Search指定和使用带有C＃NEST客户端的NGramTokenizer

什么是合适的NHibernate / Iesi.Collections.Generic.ISet 替换？

preCondition =“managedHandler”如何为模块工作？

C＃。 如果（a ==（b或c或d））。 可能吗？

如何为通用Windows应用程序设置固定窗口大小

C＃中的GetType（）和Typeof（）

“OperationContext.Current.GetCallbackChannel”实际上做了什么？

使用Fluent API的EF外键

System.DirectoryServices.Interop.UnsafeNativeMethods.IAds.GetInfo（）中的FileNotFoundException

无法使用migrate.exe运行代码首次迁移

C＃。如果（a ==（b或c或d））。可能吗？