解析空格分隔文本的最佳方法

我有这样的字符串

/c SomeText\MoreText "Some Text\More Text\Lol" SomeText

我想对它进行标记，但是我不能只是在空格上分开。我想出了一个有点丑陋的解析器，但是我想知道是否有人有更优雅的设计。

这是在C＃btw中。

编辑：我的丑陋版本，虽然丑陋，是O（N），实际上可能比使用RegEx更快。

 private string[] tokenize(string input) { string[] tokens = input.Split(' '); List output = new List(); for (int i = 0; i < tokens.Length; i++) { if (tokens[i].StartsWith("\"")) { string temp = tokens[i]; int k = 0; for (k = i + 1; k < tokens.Length; k++) { if (tokens[k].EndsWith("\"")) { temp += " " + tokens[k]; break; } else { temp += " " + tokens[k]; } } output.Add(temp); i = k + 1; } else { output.Add(tokens[i]); } } return output.ToArray(); }

你正在做的计算机术语是词法分析 ; 阅读以获得对此常见任务的总结。

根据你的例子，我猜你想要用空格分隔你的单词，但引号中的东西应该被视为没有引号的“单词”。

最简单的方法是将单词定义为正则表达式：

 ([^"^\s]+)\s*|"([^"]+)"\s*

该表达式指出“单词”是（1）非引号，由空格包围的非空白文本，或（2）由引号包围的非引用文本（后跟一些空格）。请注意使用捕获括号来突出显示所需的文本。

使用该正则表达式，您的算法很简单：在文本中搜索捕获括号定义的下一个“单词”，然后返回它。重复一遍，直到你用完“单词”。

这是我在VB.NET中可以提出的最简单的工作代码。请注意，我们必须检查两个组的数据，因为有两组捕获括号。

 Dim token As String Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*") Dim m As Match = r.Match("this is a ""test string""") While m.Success token = m.Groups(1).ToString If token.length = 0 And m.Groups.Count > 1 Then token = m.Groups(2).ToString End If m = m.NextMatch End While

注1：上面的答案是与此相同的想法。希望这个答案能够更好地解释幕后的细节:)

Microsoft.VisualBasic.FileIO命名空间（在Microsoft.VisualBasic.dll中）有一个TextFieldParser，可用于拆分空格分隔文本。它处理引号内的字符串（即“这是一个令牌”thisistokew）。

注意，仅仅因为DLL说VisualBasic并不意味着你只能在VB项目中使用它。它是整个框架的一部分。

有状态机方法。

  private enum State { None = 0, InTokin, InQuote } private static IEnumerable Tokinize(string input) { input += ' '; // ensure we end on whitespace State state = State.None; State? next = null; // setting the next state implies that we have found a tokin StringBuilder sb = new StringBuilder(); foreach (char c in input) { switch (state) { default: case State.None: if (char.IsWhiteSpace(c)) continue; else if (c == '"') { state = State.InQuote; continue; } else state = State.InTokin; break; case State.InTokin: if (char.IsWhiteSpace(c)) next = State.None; else if (c == '"') next = State.InQuote; break; case State.InQuote: if (c == '"') next = State.None; break; } if (next.HasValue) { yield return sb.ToString(); sb = new StringBuilder(); state = next.Value; next = null; } else sb.Append(c); } }

它可以很容易地扩展为嵌套引号和转义等内容。返回IEnumerable允许您的代码只根据需要进行解析。这种懒惰的方法没有任何真正的缺点，因为字符串是不可变的，所以你知道在解析整个事情之前input不会改变。

请参阅： http ： //en.wikipedia.org/wiki/Automata-Based_Programming

您也可能想要查看正则表达式。这可能会帮助你。这是从MSDN中删除的一个示例…

 using System; using System.Text.RegularExpressions; public class Test { public static void Main () { // Define a regular expression for repeated words. Regex rx = new Regex(@"\b(?\w+)\s+(\k)\b", RegexOptions.Compiled | RegexOptions.IgnoreCase); // Define a test string. string text = "The the quick brown fox fox jumped over the lazy dog dog."; // Find matches. MatchCollection matches = rx.Matches(text); // Report the number of matches found. Console.WriteLine("{0} matches found in:\n {1}", matches.Count, text); // Report on each match. foreach (Match match in matches) { GroupCollection groups = match.Groups; Console.WriteLine("'{0}' repeated at positions {1} and {2}", groups["word"].Value, groups[0].Index, groups[1].Index); } } } // The example produces the following output to the console: // 3 matches found in: // The the quick brown fox fox jumped over the lazy dog dog. // 'The' repeated at positions 0 and 4 // 'fox' repeated at positions 20 and 25 // 'dog' repeated at positions 50 and 54

克雷格是对的 – 使用正则表达式。 Regex.Split可能会更加简洁，满足您的需求。

[^ \吨] + \ T | “[^”] +“\吨

使用正则表达式肯定是最好的选择，但是这个只返回整个字符串。我试图调整它，但到目前为止运气不大。

 string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, @"[^\t]+\t|""[^""]+""\t");

解析空格分隔文本的最佳方法

有人发表过任何C＃4编码标准/指南/风格指南吗？

在C＃中查找有关通过USB连接的所有串行设备的信息

从图像中删除周围的空白

从C＃中的图像EXIF获取GPS数据

.NET压缩XML以存储在SQL Server数据库中

在用户控件单击事件中获取splitcontainer上下文

xna 4.0和加载图像失败

从delphi2006调用.net dll来显示wpf表单

C＃：同步两个RichTextBox的滚动位置？

在C＃中的托盘中拖放NotifyIcon