如何检测角色是否属于右到左语言?

什么是判断字符串是否包含“从右到左”语言的文本的好方法。

我发现这个问题建议采用以下方法:

public bool IsArabic(string strCompare) { char[] chars = strCompare.ToCharArray(); foreach (char ch in chars) if (ch >= '\u0627' && ch <= '\u0649') return true; return false; } 

虽然这可能适用于阿拉伯语,但这似乎不包括其他RTL语言,如希伯来语。 有没有通用的方法来知道某个特定字符属于RTL语言?

Unicode字符具有与之关联的不同属性。 这些属性不能从代码点派生; 你需要一个表来告诉你角色是否具有某种属性。

您对双向属性“R”或“AL”(RandALCat)的字符感兴趣。

RandALCat字符是具有明确的从右到左方向性的字符。

这是Unicode 3.2(来自RFC 3454 )的完整列表:

 D.双向表

 D.1具有双向属性“R”或“AL”的字符

 -----开始表D.1 -----
 05BE
 05C0
 05C3
 05D0-05EA
 05F0-05F4
 061B
 061F
 0621-063A
 0640-064A
 066D-066F
 0671-06D5
 06DD
 06E5-06E6
 06FA-06FE
 0700-070D
 0710
 0712-072C
 0780-07A5
 07B1
 200F
 FB1D
 FB1F-FB28
 FB2A-FB36
 FB38-FB3C
 FB3E
 FB40,FB41
 FB43,FB44
 FB46-FBB1
 FBD3-FD3D
 FD50-FD8F
 FD92-FDC7
 FDF0-FDFC
 FE70-FE74
 FE76-FEFC
 -----表D.1 -----

以下是从Unicode 6.0获取完整列表的一些代码:

 var url = "http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt"; var query = from record in new WebClient().DownloadString(url).Split('\n') where !string.IsNullOrEmpty(record) let properties = record.Split(';') where properties[4] == "R" || properties[4] == "AL" select int.Parse(properties[0], NumberStyles.AllowHexSpecifier); foreach (var codepoint in query) { Console.WriteLine(codepoint.ToString("X4")); } 

请注意,这些值是Unicode代码点。 C#/ .NET中的字符串是UTF-16编码的,需要先转换为Unicode代码点(参见Char.ConvertToUtf32 )。 这是一个检查字符串是否包含至少一个RandALCat字符的方法:

 static void IsAnyCharacterRightToLeft(string s) { for (var i = 0; i < s.Length; i += char.IsSurrogatePair(s, i) ? 2 : 1) { var codepoint = char.ConvertToUtf32(s, i); if (IsRandALCat(codepoint)) { return true; } } return false; } 

您可以尝试在正则表达式中使用“ 命名块 ”。 只需挑选出从右到左的块,然后形成正则表达式。 例如:

 \p{IsArabic}|\p{IsHebrew} 

如果该正则表达式返回true,则字符串中至少有一个希伯来语或阿拉伯语字符。

Unicode 6.0的所有“AL”或“R”(来自http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt )

 bool hasRandALCat = 0; if(c >= 0x5BE && c <= 0x10B7F) { if(c <= 0x85E) { if(c == 0x5BE) hasRandALCat = 1; else if(c == 0x5C0) hasRandALCat = 1; else if(c == 0x5C3) hasRandALCat = 1; else if(c == 0x5C6) hasRandALCat = 1; else if(0x5D0 <= c && c <= 0x5EA) hasRandALCat = 1; else if(0x5F0 <= c && c <= 0x5F4) hasRandALCat = 1; else if(c == 0x608) hasRandALCat = 1; else if(c == 0x60B) hasRandALCat = 1; else if(c == 0x60D) hasRandALCat = 1; else if(c == 0x61B) hasRandALCat = 1; else if(0x61E <= c && c <= 0x64A) hasRandALCat = 1; else if(0x66D <= c && c <= 0x66F) hasRandALCat = 1; else if(0x671 <= c && c <= 0x6D5) hasRandALCat = 1; else if(0x6E5 <= c && c <= 0x6E6) hasRandALCat = 1; else if(0x6EE <= c && c <= 0x6EF) hasRandALCat = 1; else if(0x6FA <= c && c <= 0x70D) hasRandALCat = 1; else if(c == 0x710) hasRandALCat = 1; else if(0x712 <= c && c <= 0x72F) hasRandALCat = 1; else if(0x74D <= c && c <= 0x7A5) hasRandALCat = 1; else if(c == 0x7B1) hasRandALCat = 1; else if(0x7C0 <= c && c <= 0x7EA) hasRandALCat = 1; else if(0x7F4 <= c && c <= 0x7F5) hasRandALCat = 1; else if(c == 0x7FA) hasRandALCat = 1; else if(0x800 <= c && c <= 0x815) hasRandALCat = 1; else if(c == 0x81A) hasRandALCat = 1; else if(c == 0x824) hasRandALCat = 1; else if(c == 0x828) hasRandALCat = 1; else if(0x830 <= c && c <= 0x83E) hasRandALCat = 1; else if(0x840 <= c && c <= 0x858) hasRandALCat = 1; else if(c == 0x85E) hasRandALCat = 1; } else if(c == 0x200F) hasRandALCat = 1; else if(c >= 0xFB1D) { if(c == 0xFB1D) hasRandALCat = 1; else if(0xFB1F <= c && c <= 0xFB28) hasRandALCat = 1; else if(0xFB2A <= c && c <= 0xFB36) hasRandALCat = 1; else if(0xFB38 <= c && c <= 0xFB3C) hasRandALCat = 1; else if(c == 0xFB3E) hasRandALCat = 1; else if(0xFB40 <= c && c <= 0xFB41) hasRandALCat = 1; else if(0xFB43 <= c && c <= 0xFB44) hasRandALCat = 1; else if(0xFB46 <= c && c <= 0xFBC1) hasRandALCat = 1; else if(0xFBD3 <= c && c <= 0xFD3D) hasRandALCat = 1; else if(0xFD50 <= c && c <= 0xFD8F) hasRandALCat = 1; else if(0xFD92 <= c && c <= 0xFDC7) hasRandALCat = 1; else if(0xFDF0 <= c && c <= 0xFDFC) hasRandALCat = 1; else if(0xFE70 <= c && c <= 0xFE74) hasRandALCat = 1; else if(0xFE76 <= c && c <= 0xFEFC) hasRandALCat = 1; else if(0x10800 <= c && c <= 0x10805) hasRandALCat = 1; else if(c == 0x10808) hasRandALCat = 1; else if(0x1080A <= c && c <= 0x10835) hasRandALCat = 1; else if(0x10837 <= c && c <= 0x10838) hasRandALCat = 1; else if(c == 0x1083C) hasRandALCat = 1; else if(0x1083F <= c && c <= 0x10855) hasRandALCat = 1; else if(0x10857 <= c && c <= 0x1085F) hasRandALCat = 1; else if(0x10900 <= c && c <= 0x1091B) hasRandALCat = 1; else if(0x10920 <= c && c <= 0x10939) hasRandALCat = 1; else if(c == 0x1093F) hasRandALCat = 1; else if(c == 0x10A00) hasRandALCat = 1; else if(0x10A10 <= c && c <= 0x10A13) hasRandALCat = 1; else if(0x10A15 <= c && c <= 0x10A17) hasRandALCat = 1; else if(0x10A19 <= c && c <= 0x10A33) hasRandALCat = 1; else if(0x10A40 <= c && c <= 0x10A47) hasRandALCat = 1; else if(0x10A50 <= c && c <= 0x10A58) hasRandALCat = 1; else if(0x10A60 <= c && c <= 0x10A7F) hasRandALCat = 1; else if(0x10B00 <= c && c <= 0x10B35) hasRandALCat = 1; else if(0x10B40 <= c && c <= 0x10B55) hasRandALCat = 1; else if(0x10B58 <= c && c <= 0x10B72) hasRandALCat = 1; else if(0x10B78 <= c && c <= 0x10B7F) hasRandALCat = 1; } } 

编辑:

这就是我现在使用的,包括Vowelization字符以及希伯来语和阿拉伯语中的所有内容:

 [\u0591-\u07FF] 

老答案:

如果您需要在一个句子中检测RTL语言,这个简化的RegEx可能就足够了:

 [א-ת؀-ۿ] 

如果想用希伯来语写一些东西,就必须使用其中一个字符,这个案例与阿拉伯语类似。

它不包括元音化字符,因此如果您需要捕获所有整个单词或绝对所有RTL字符,您最好使用其他答案之一。 希伯来语中的元音化特征在非诗歌文本中非常罕见。 我不知道阿拉伯文。