正则表达式不使用Unicode字符范围

注意

另一个问题是C#正则表达式已经在模式中使用了\ Uxxxxxxxx字符 。 这个问题的不同之处在于它不是关于如何计算代理对,而是如何在正则表达式中表达高于0的unicode平面。 从我的问题中我应该清楚,我已经理解为什么这些代码单元被表示为2个字符 – 它们是代理对(这是另一个问题所要求的)。 我的问题是如何一般地转换它们(因为我无法控制正在使用该程序的正则表达式),因此它们可以被.NET Regex引擎使用。

注意我现在有办法做到这一点,并希望添加我的问题的答案,但由于现在标记为重复,我无法添加我的答案。

我有一些测试数据被传递给我移植到c#的Java库。 我已经将一个特定的问题案例作为一个例子。 原始中的字符类是UTF-32 = \U0001BCA0-\U0001BCA3 ,.NET不易消耗 – 我们得到"Unrecognized escape sequence \U"错误。

我试图转换为UTF-16,我已经确认\ U0001BCA0和\ U0001BCA3的结果应该是预期的。

 UTF-32 | Codepoint | High Surrogate | Low Surrogate | UTF-16 --------------------------------------------------------------------------- 0x0001BCA0 | 113824 | 55343 | 56480 | \uD82F\uDCA0 0x0001BCA3 | 113827 | 55343 | 56483 | \uD82F\uDCA3 

但是,当我将字符串"([\uD82F\uDCA0-\uD82F\uDCA3])"传递给Regex类的构造函数时,我得到一个exception"[xy] range in reverse order"

虽然很清楚字符是以正确的顺序指定的(它在Java中工作),但我反过来尝试并得到相同的错误消息。

我也尝试将UTF-32字符从\U0001BCA0-\U0001BCA3\x01BCA0-\x01BCA3 ,但仍然以"[xy] range in reverse order"获得exception。

那么,如何让.NET Regex类成功解析这个字符范围呢?

注意:我尝试更改代码以生成一个正则表达式字符类,其中包含所有字符而不是范围,它似乎工作,但这将把我的几十个字符的正则表达成几千个字符,这肯定会不会为表现做出奇迹。

实际的正则表达式示例

同样,上面是一个更大字符串失败的孤立示例。 我正在寻找的是转换像这样的正则表达式的一般方法,因此它们可以由.NET Regex类解析。

 "([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" + "\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" + "\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" + "\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " + "| [\\u000D] | [\\u000A]) ()" 

您假设Regex"\uD82F\uDCA0"识别为复合字符。 情况并非如此,因为.NET中字符串的内部表示是16位Unicode。

Unicode具有代码点的概念,这是一种独立于物理表示的抽象概念。 根据使用的实际编码,并非所有代码点都可以显示在一个字符中。 在UTF-8中,这变得非常明显,因为127以上的所有代码点都需要两个或更多字符。 在.NET中,字符是Unicode,这意味着对于高于0的平面,您需要组合字符。 这些仍然被正则表达式引擎识别为单个字符。

长话短说:不要将字符组合视为代码点,将它们视为单个字符。 所以在你的情况下,正则表达式将是:

 using System; using System.Text.RegularExpressions; public class Program { public static void Main() { var regex = new Regex("(\uD82F[\uDCA0-\uDCA3])"); Console.WriteLine(regex.Match("\uD82F\uDCA2").Success); } } 

你可以在这里试试代码 。

C#中的字符串是UTF-16编码的。 这就是为什么这个正则表达式被视为:

  • 符号'\uD82F'
  • 范围\uDCA0-\uD82F
  • 符号'\uDCA3'

范围\uDCA0-\uD82F显然不正确,并导致[xy] range in reverse orderexception的[xy] range in reverse order

不幸的是,没有简单的解决方案,因为它是由C#字符串的性质引起的。 您不能将UTF-32符号放入一个C#字符中,也不能将多字符字符串用作范围边框。

可能的解决方法是使用半正则表达式解决方案:从字符串中提取此类符号,并通过纯C#代码执行比较。 当然它看起来很难看,但我没有看到用C#中的原始正则表达式来实现这一点的另一种方法。

虽然这个问题的其他贡献者提供了一些线索,但我需要一个答案。 我的测试是一个由文件输入构建的正则表达式驱动的规则引擎,因此将逻辑硬编码到C#中是不可取的。

但是,我确实在这里学到了

  1. .NET Regex类不支持代理对和
  2. 您可以通过使用正则表达式更改伪造对代理对范围的支持

但是,当然,在我的数据驱动的情况下,我无法手动将正则表达式更改为.NET将接受的格式 – 我需要自动化它。 所以,我创建了下面的Utf32Regex类,它在构造函数中直接接受UTF32字符,并在内部将它们转换为.NET理解的正则表达式。

例如,它将转换正则表达式

 "[abc\\U00011DEF-\\U00013E07]" 

 "(?:[abc]|\\uD807[\\uDDEF-\\uDFFF]|[\\uD808-\\uD80E][\\uDC00-\\uDFFF]|\\uD80F[\\uDC00-\\uDE07])" 

要么

 "([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" + "\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" + "\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" + "\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " + "| [\\u000D] | [\\u000A]) ()" 

 "((?:[\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD\\u061C\\u180E" + "\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-\\uDFFF\\uFEFF\\uFFF0-\\uFFFB]|" + "\\uD82F[\\uDCA0-\\uDCA3]|\\uD834[\\uDD73-\\uDD7A]|\\uDB40[\\uDC00-\\uDC1F]|" + "\\uDB40[\\uDC80-\\uDCFF]|\\uDB40[\\uDDF0-\\uDFFF]|[\\uDB41-\\uDB42][\\uDC00-\\uDFFF]|" + "\\uDB43[\\uDC00-\\uDFFF]) | [\\u000D] | [\\u000A]) ()" 

Utf32Regex.cs

 using System; using System.Globalization; using System.Text; using System.Text.RegularExpressions; ///  /// Patches the  class so it will automatically convert and interpret /// UTF32 characters expressed like \U00010000 or UTF32 ranges expressed /// like \U00010000-\U00010001. ///  public class Utf32Regex : Regex { private const char MinLowSurrogate = '\uDC00'; private const char MaxLowSurrogate = '\uDFFF'; private const char MinHighSurrogate = '\uD800'; private const char MaxHighSurrogate = '\uDBFF'; // Match any character class such as [Az] private static readonly Regex characterClass = new Regex( "(?\\\\U(?:00)?[0-9A-Fa-f]{6})-(?\\\\U(?:00)?[0-9A-Fa-f]{6})|(?\\\\U(?:00)?[0-9A-Fa-f]{6})", RegexOptions.Compiled); public Utf32Regex() : base() { } public Utf32Regex(string pattern) : base(ConvertUTF32Characters(pattern)) { } public Utf32Regex(string pattern, RegexOptions options) : base(ConvertUTF32Characters(pattern), options) { } public Utf32Regex(string pattern, RegexOptions options, TimeSpan matchTimeout) : base(ConvertUTF32Characters(pattern), options, matchTimeout) { } private static string ConvertUTF32Characters(string regexString) { StringBuilder result = new StringBuilder(); // Convert any UTF32 character ranges \U00000000-\U00FFFFFF to their // equivalent UTF16 characters ConvertUTF32CharacterClassesToUTF16Characters(regexString, result); // Now find all of the individual characters that were not in ranges and // fix those as well. ConvertUTF32CharactersToUTF16(result); return result.ToString(); } private static void ConvertUTF32CharacterClassesToUTF16Characters(string regexString, StringBuilder result) { Match match = characterClass.Match(regexString); // Reset int lastEnd = 0; if (match.Success) { do { string characterClass = match.Groups[1].Value; string convertedCharacterClass = ConvertUTF32CharacterRangesToUTF16Characters(characterClass); result.Append(regexString.Substring(lastEnd, match.Index - lastEnd)); // Remove the match result.Append(convertedCharacterClass); // Append replacement lastEnd = match.Index + match.Length; } while ((match = match.NextMatch()).Success); } result.Append(regexString.Substring(lastEnd)); // Append tail } private static string ConvertUTF32CharacterRangesToUTF16Characters(string characterClass) { StringBuilder result = new StringBuilder(); StringBuilder chars = new StringBuilder(); Match match = utf32Range.Match(characterClass); // Reset int lastEnd = 0; if (match.Success) { do { string utf16Chars; string rangeBegin = match.Groups["begin"].Value.Substring(2); if (!string.IsNullOrEmpty(match.Groups["end"].Value)) { string rangeEnd = match.Groups["end"].Value.Substring(2); utf16Chars = UTF32RangeToUTF16Chars(rangeBegin, rangeEnd); } else { utf16Chars = UTF32ToUTF16Chars(rangeBegin); } result.Append(characterClass.Substring(lastEnd, match.Index - lastEnd)); // Remove the match chars.Append(utf16Chars); // Append replacement lastEnd = match.Index + match.Length; } while ((match = match.NextMatch()).Success); } result.Append(characterClass.Substring(lastEnd)); // Append tail of character class // Special case - if we have removed all of the contents of the // character class, we need to remove the square brackets and the // alternation character | int emptyCharClass = result.IndexOf("[]"); if (emptyCharClass >= 0) { result.Remove(emptyCharClass, 2); // Append replacement ranges (exclude beginning |) result.Append(chars.ToString(1, chars.Length - 1)); } else { // Append replacement ranges result.Append(chars.ToString()); } if (chars.Length > 0) { // Wrap both the character class and any UTF16 character alteration into // a non-capturing group. return "(?:" + result.ToString() + ")"; } return result.ToString(); } private static void ConvertUTF32CharactersToUTF16(StringBuilder result) { while (true) { int where = result.IndexOf("\\U00"); if (where < 0) { break; } string cp = UTF32ToUTF16Chars(result.ToString(where + 2, 8)); result.Replace(where, where + 10, cp); } } private static string UTF32RangeToUTF16Chars(string hexBegin, string hexEnd) { var result = new StringBuilder(); int beginCodePoint = int.Parse(hexBegin, NumberStyles.HexNumber); int endCodePoint = int.Parse(hexEnd, NumberStyles.HexNumber); var beginChars = char.ConvertFromUtf32(beginCodePoint); var endChars = char.ConvertFromUtf32(endCodePoint); int beginDiff = endChars[0] - beginChars[0]; if (beginDiff == 0) { // If the begin character is the same, we can just use the syntax \uD807[\uDDEF-\uDFFF] result.Append("|"); AppendUTF16Character(result, beginChars[0]); result.Append('['); AppendUTF16Character(result, beginChars[1]); result.Append('-'); AppendUTF16Character(result, endChars[1]); result.Append(']'); } else { // If the begin character is not the same, create 3 ranges // 1. The remainder of the first // 2. A range of all of the middle characters // 3. The beginning of the last result.Append("|"); AppendUTF16Character(result, beginChars[0]); result.Append('['); AppendUTF16Character(result, beginChars[1]); result.Append('-'); AppendUTF16Character(result, MaxLowSurrogate); result.Append(']'); // We only need a middle range if the ranges are not adjacent if (beginDiff > 1) { result.Append("|"); // We only need a character class if there are more than 1 // characters in the middle range if (beginDiff > 2) { result.Append('['); } AppendUTF16Character(result, (char)(Math.Min(beginChars[0] + 1, MaxHighSurrogate))); if (beginDiff > 2) { result.Append('-'); AppendUTF16Character(result, (char)(Math.Max(endChars[0] - 1, MinHighSurrogate))); result.Append(']'); } result.Append('['); AppendUTF16Character(result, MinLowSurrogate); result.Append('-'); AppendUTF16Character(result, MaxLowSurrogate); result.Append(']'); } result.Append("|"); AppendUTF16Character(result, endChars[0]); result.Append('['); AppendUTF16Character(result, MinLowSurrogate); result.Append('-'); AppendUTF16Character(result, endChars[1]); result.Append(']'); } return result.ToString(); } private static string UTF32ToUTF16Chars(string hex) { int codePoint = int.Parse(hex, NumberStyles.HexNumber, CultureInfo.InvariantCulture); return UTF32ToUTF16Chars(codePoint); } private static string UTF32ToUTF16Chars(int codePoint) { StringBuilder result = new StringBuilder(); UTF32ToUTF16Chars(codePoint, result); return result.ToString(); } private static void UTF32ToUTF16Chars(int codePoint, StringBuilder result) { // Use regex alteration to on the entire range of UTF32 code points // to ensure each one is treated as a group. result.Append("|"); AppendUTF16CodePoint(result, codePoint); } private static void AppendUTF16CodePoint(StringBuilder text, int cp) { var chars = char.ConvertFromUtf32(cp); AppendUTF16Character(text, chars[0]); if (chars.Length == 2) { AppendUTF16Character(text, chars[1]); } } private static void AppendUTF16Character(StringBuilder text, char c) { text.Append(@"\u"); text.Append(Convert.ToString(c, 16).ToUpperInvariant()); } } 

StringBuilderExtensions.cs

 public static class StringBuilderExtensions { ///  /// Searches for the first index of the specified character. The search for /// the character starts at the beginning and moves towards the end. ///  /// This . /// The string to find. /// The index of the specified character, or -1 if the character isn't found. public static int IndexOf(this StringBuilder text, string value) { return IndexOf(text, value, 0); } ///  /// Searches for the index of the specified character. The search for the /// character starts at the specified offset and moves towards the end. ///  /// This . /// The string to find. /// The starting offset. /// The index of the specified character, or -1 if the character isn't found. public static int IndexOf(this StringBuilder text, string value, int startIndex) { if (text == null) throw new ArgumentNullException("text"); if (value == null) throw new ArgumentNullException("value"); int index; int length = value.Length; int maxSearchLength = (text.Length - length) + 1; for (int i = startIndex; i < maxSearchLength; ++i) { if (text[i] == value[0]) { index = 1; while ((index < length) && (text[i + index] == value[index])) ++index; if (index == length) return i; } } return -1; } ///  /// Replaces the specified subsequence in this builder with the specified /// string. ///  /// this builder. /// the inclusive begin index. /// the exclusive end index. /// the replacement string. /// this builder. ///  /// if  is negative, greater than the current ///  or greater than . ///  /// if  is null. public static StringBuilder Replace(this StringBuilder text, int start, int end, string str) { if (str == null) { throw new ArgumentNullException(nameof(str)); } if (start >= 0) { if (end > text.Length) { end = text.Length; } if (end > start) { int stringLength = str.Length; int diff = end - start - stringLength; if (diff > 0) { // replacing with fewer characters text.Remove(start, diff); } else if (diff < 0) { // replacing with more characters...need some room text.Insert(start, new char[-diff]); } // copy the chars based on the new length for (int i = 0; i < stringLength; i++) { text[i + start] = str[i]; } return text; } if (start == end) { text.Insert(start, str); return text; } } throw new IndexOutOfRangeException(); } } 

请注意,这个测试不是很好,可能不是很强大,但出于测试目的,应该没问题。