解析单个句子的正则表达式是什么？

我正在寻找一个好的.NET正则表达式，我可以用它来解析文本正文中的单个句子。

它应该能够将以下文本块解析为六个句子：

Hello world! How are you? I am fine. This is a difficult sentence because I use ID Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.

事实certificate这比我原先想象的更具挑战性。

任何帮助将不胜感激。我将使用它来训练已知文本体系。

试试这个@"(\S.+?[.!?])(?=\s+|$)" ：

 string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use ID Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)"); foreach (Match match in rx.Matches(str)) { int i = match.Index; Console.WriteLine(match.Value); }

结果：

 Hello world! How are you? I am fine. This is a difficult sentence because I use ID Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.

当然，对于复杂的，你需要一个真正的解析器，如SharpNLP或NLTK。我只是一个快速而肮脏的人。

这是SharpNLP信息，其特点是：

SharpNLP是用C＃编写的自然语言处理工具的集合。目前它提供以下NLP工具：

句子分割器
一个标记器
词性标注器
一个chunker（用于“查找非递归语法注释，如名词短语块”）
解析器
一个名字查找器
共同参与工具
WordNet词汇数据库的接口

 var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use ID Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23."; Regex.Split(str, @"(?<=[.?!])\s+").Dump();

我在LINQPad中测试了这个。

使用正则表达式解析自然语言是不可能的。一句话的结尾是什么？许多地方都可能出现一段时间（egeg）。您应该使用自然语言解析工具包，例如OpenNLP或NLTK。不幸的是，C＃中的产品很少（如果有的话）。因此，您可能必须创建Web服务或以其他方式链接到C＃。

请注意，如果您依赖“ID”中的确切空格，将来会导致问题。您很快就会找到打破正则表达式的示例。例如，大多数人在他们的内容之后放置空格。

WP中的开放和商业产品有很好的总结（ http://en.wikipedia.org/wiki/Natural_language_processing_toolkits ）。我们已经使用了其中的几个。值得付出努力。

[你用“火车”这个词。这通常与机器学习相关（这是NLP的一种方法，并且已经用于句子分割）。事实上，我提到的工具包包括机器学习。我怀疑那不是你的意思 – 而是你会通过启发式来表达你的表达。别！]

只有正则表达式才能实现这一点，除非你确切知道你有哪些“难”的标记，例如“id”，“Mr.”等。例如，有多少句话是“请显示你的身份证，先生。键。”？我不熟悉任何C＃实现，但我使用了NLTK的Punkt标记器。可能不应该太难以重新实施。

我使用了这里发布的建议，想出了接缝的正则表达式，以实现我想要做的事情：

 (?\S.+?(?[.!?]|\Z))(?=\s+|\Z)

我使用Expresso提出：

 // using System.Text.RegularExpressions; ///  /// Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM /// Using Expresso Version: 3.0.3276, http://www.ultrapico.com /// /// A description of the regular expression: /// /// [Sentence]: A named capture group. [\S.+?(?[.!?]|\Z)] /// \S.+?(?[.!?]|\Z) /// Anything other than whitespace /// Any character, one or more repetitions, as few as possible /// [Terminator]: A named capture group. [[.!?]|\Z] /// Select from 2 alternatives /// Any character in this class: [.!?] /// End of string or before new line at end of string /// Match a suffix but exclude it from the capture. [\s+|\Z] /// Select from 2 alternatives /// Whitespace, one or more repetitions /// End of string or before new line at end of string /// /// /// 
 public static Regex regex = new Regex( "(?\\S.+?(?[.!?]|\\Z))(?=\\s+|\\Z)", RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled ); // This is the replacement string public static string regexReplace = "$& [${Day}-${Month}-${Year}]"; //// Replace the matched text in the InputText using the replacement pattern // string result = regex.Replace(InputText,regexReplace); //// Split the InputText wherever the regex matches // string[] results = regex.Split(InputText); //// Capture the first Match, if any, in the InputText // Match m = regex.Match(InputText); //// Capture all Matches in the InputText // MatchCollection ms = regex.Matches(InputText); //// Test to see if there is a match in the InputText // bool IsMatch = regex.IsMatch(InputText); //// Get the names of all the named and numbered capture groups // string[] GroupNames = regex.GetGroupNames(); //// Get the numbers of all the named and numbered capture groups // int[] GroupNumbers = regex.GetGroupNumbers();

大多数人建议使用SharpNLP，你应该这样做，除非你希望你的QA部门有一个bug。

但是，因为你可能面临某种压力。这是处理像“博士”这样的词的另一种尝试和“X.”。但是，它将以一个以“它”结尾的句子失败。

你好，世界！你好吗？我很好。这是一个难以判断的句子，因为我使用ID Newlines也应该被接受。数字不应该导致句子中断，如1.23。参见B博士或FooBar先生的贲门幽门螺杆菌评估。

  var result = new Regex(@"(\S.+?[.!?])(?=\s+|$)(? !String.IsNullOrWhiteSpace(s)).ToArray(); foreach (var match in result) { Console.WriteLine(match); }

解析单个句子的正则表达式是什么？

如何处理Json.Net解析中的错误

ManualResetEvent与Thread.Sleep

如何从.Net删除cookie

在LibGit2Sharp中找出提交所属的分支？

如何计算一个位置与另一个位置之间的距离/距离（c＃）

如何管理无效检查的冲击？

如何枚举所有HID设备？ C＃

如何创建放入数据库的角色？

在C＃中从short转换为byte时会发生什么？

如何将字节数组转换为UInt32数组？