C#.NET中的UTF-16安全子字符串

我想得到一个给定长度的子串150.但是,我想确保我不切断unicode字符之间的字符串。

例如,请参阅以下代码:

var str = "Hello😀 world!"; var substr = str.Substring(0, 6); 

这里substr是一个无效的字符串,因为笑脸字符被切成两半。

相反,我想要一个如下function:

 var str = "Hello😀 world!"; var substr = str.UnicodeSafeSubstring(0, 6); 

其中substr包含“Hello😀”

作为参考,以下是我将如何使用rangeOfComposedCharacterSequencesForRange在Objective-C中rangeOfComposedCharacterSequencesForRange

 NSString* str = @"Hello😀 world!"; NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)]; NSString* substr = [message substringWithRange:range]]; 

C#中的等效代码是什么?

这应该返回从索引startIndex开始的最大子字符串,并且长度达到“完整”字形的length …因此将删除初始/最终“分裂”代理对,初始组合标记将被删除,最终字符将缺少其组合标记将被删除。

请注意,可能它不是你问的…你似乎想要使用字素作为度量单位(或者你想要包括最后一个字母,即使它的长度超过length参数)

 public static class StringEx { public static string UnicodeSafeSubstring(this string str, int startIndex, int length) { if (str == null) { throw new ArgumentNullException("str"); } if (startIndex < 0 || startIndex > str.Length) { throw new ArgumentOutOfRangeException("startIndex"); } if (length < 0) { throw new ArgumentOutOfRangeException("length"); } if (startIndex + length > str.Length) { throw new ArgumentOutOfRangeException("length"); } if (length == 0) { return string.Empty; } var sb = new StringBuilder(length); int end = startIndex + length; var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex); while (enumerator.MoveNext()) { string grapheme = enumerator.GetTextElement(); startIndex += grapheme.Length; if (startIndex > length) { break; } // Skip initial Low Surrogates/Combining Marks if (sb.Length == 0) { if (char.IsLowSurrogate(grapheme[0])) { continue; } UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0); if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark) { continue; } } sb.Append(grapheme); if (startIndex == length) { break; } } return sb.ToString(); } } 

Variant将简单地在子串的末尾包含“额外”字符,如果有必要使整个字形:

 public static class StringEx { public static string UnicodeSafeSubstring(this string str, int startIndex, int length) { if (str == null) { throw new ArgumentNullException("str"); } if (startIndex < 0 || startIndex > str.Length) { throw new ArgumentOutOfRangeException("startIndex"); } if (length < 0) { throw new ArgumentOutOfRangeException("length"); } if (startIndex + length > str.Length) { throw new ArgumentOutOfRangeException("length"); } if (length == 0) { return string.Empty; } var sb = new StringBuilder(length); int end = startIndex + length; var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex); while (enumerator.MoveNext()) { if (startIndex >= length) { break; } string grapheme = enumerator.GetTextElement(); startIndex += grapheme.Length; // Skip initial Low Surrogates/Combining Marks if (sb.Length == 0) { if (char.IsLowSurrogate(grapheme[0])) { continue; } UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0); if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark) { continue; } } sb.Append(grapheme); } return sb.ToString(); } } 

这将返回你问的"Hello😀 world!".UnicodeSafeSubstring(0, 6) == "Hello😀"

看起来你正在寻找在字形上拆分字符串,即在单个显示的字符上。

在这种情况下,您有一个方便的方法: StringInfo.SubstringByTextElements

 var str = "Hello😀 world!"; var substr = new StringInfo(str).SubstringByTextElements(0, 6);