如何进行平衡组捕获？

假设我有这个文本输入。

tes{}tR{R{abc}aD{mnoR{xyz}}}

我想提取ff输出：

  R{abc} R{xyz} D{mnoR{xyz}} R{R{abc}aD{mnoR{xyz}}}

目前，我只能使用msdn中的平衡组方法提取{}组内的内容。这是模式：

  ^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$

有谁知道如何在输出中包含R {}和D {}？

我认为这里需要采用不同的方法。一旦你匹配第一个更大的群体R{R{abc}aD{mnoR{xyz}}} （请参阅我对可能的拼写错误的评论），你将无法获得内部的子群，因为正则表达式不允许你捕获单个R{ ... }组。

因此，必须有一些方法来捕获而不是消费，显而易见的方法是使用积极的前瞻。从那里，你可以把你使用的表达，尽管有一些变化，以适应新的焦点变化，我想出了：

 (?=([AZ](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))

[我还将’打开’重命名为’O’并删除了用于近距离大括号的命名捕获，以使其缩短并避免在比赛中产生噪音]

在regexhero.net（我目前唯一知道的免费.NET正则表达式测试程序）中，我得到了以下捕获组：

 1: R{R{abc}aD{mnoR{xyz}}} 1: R{abc} 1: D{mnoR{xyz}} 1: R{xyz}

正则表达式的细分：

 (?= # Opening positive lookahead ([AZ] # Opening capture group and any uppercase letter (to match R & D) (?: # First non-capture group opening (?: # Second non-capture group opening (?'O'{) # Get the named opening brace [^{}]* # Any non-brace )+ # Close of second non-capture group and repeat over as many times as necessary (?: # Third non-capture group opening (?'-O'}) # Removal of named opening brace when encountered [^{}]*? # Any other non-brace characters in case there are more nested braces )+ # Close of third non-capture group and repeat over as many times as necessary )+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces (?(O)(?!)) # Condition to prevent unbalanced braces ) # Close capture group ) # Close positive lookahead

以下内容在C＃中不起作用

我实际上想要尝试它应该如何在PCRE引擎上运行，因为有选项可以使用递归正则表达式，我认为它更容易，因为我更熟悉它并且产生了更短的正则表达式:)

 (?=([AZ]{(?:[^{}]|(?1))+}))

regex101演示

 (?= # Opening positive lookahead ([AZ] # Opening capture group and any uppercase letter (to match R & D) { # Opening brace (?: # Opening non-capture group [^{}] # Matches non braces | # OR (?1) # Recurse first capture group )+ # Close non-capture group and repeat as many times as necessary } # Closing brace ) # Close of capture group ) # Close of positive lookahead

我不确定单个正则表达式是否能够满足您的需求：这些嵌套的子串总是搞乱它。

一种解决方案可能是以下算法（用Java编写，但我想对C＃的翻译不会那么难）：

 /** * Finds all matches (ie including sub/nested matches) of the regex in the input string. * * @param input * The input string. * @param regex * The regex pattern. It has to target the most nested substrings. For example, given the following input string * A{01B{23}45C{67}89}, if you want to catch every X{*} substrings (where X is a capital letter), * you have to use [AZ][{][^{]+?[}] or [AZ][{][^{}]+[}] instead of [AZ][{].+?[}]. * @param format * The format must follow the  matches = new LinkedHashMap(); Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input); // if a substring has been found while (matcher.find()) { // create a unique replacement string using the counter String replace = String.format(format, counter++); // store the relation "replacement string --> initial substring" in a queue matches.put(replace, matcher.group()); String end = input.substring(matcher.end(), input.length()); String start = input.substring(0, matcher.start()); // replace the found substring by the created unique replacement string input = start + replace + end; // reiterate on the new input string (faking the original matcher.find() implementation) matcher = pattern.matcher(input); } List> entries = new LinkedList>(matches.entrySet()); // for each relation "replacement string --> initial substring" of the queue for (int i = 0; i < entries.size(); i++) { Entry current = entries.get(i); // for each relation that could have been found before the current one (ie more nested) for (int j = 0; j < i; j++) { Entry previous = entries.get(j); // if the current initial substring contains the previous replacement string if (current.getValue().contains(previous.getKey())) { // replace the previous replacement string by the previous initial substring in the current initial substring current.setValue(current.getValue().replace(previous.getKey(), previous.getValue())); } } } return new LinkedList(matches.values()); }

因此，在您的情况下：

 String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}"; String regex = "[AZ][{][^{}]+[}]"; findAllMatches(input, regex, null);

 R{abc} R{xyz} D{mnoR{xyz}} R{R{abc}aD{mnoR{xyz}}}

在.Net正则表达式中平衡组可以让您控制准确捕获的内容，并且.Net正则表达式引擎保留组的所有捕获的完整历史记录（与仅捕获每个组的最后一次出现的大多数其他类型）不同。

MSDN示例有点过于复杂。匹配nestes结构的更简单方法是：

 (?> (?)\p{Lu}\{ # Push to the O stack, and match an upper-case letter and { | # OR \}(?<-O>) # Match } and pop from the stack | # OR \p{Ll} # Match a lower-case letter )+ (?(O)(?!)) # Make sure the stack is empty

或者在一行中：

 (?>(?)\p{Lu}\{|\}(?<-O>)|\p{Ll})+(?(O)(?!))

关于Regex Storm的工作示例

在你的例子中，它也匹配字符串开头的"tes" ，但不要担心，我们还没有完成。

通过小的修正，我们还可以捕获 R{ … }对之间的出现：

 (?>(?)\p{Lu}\{|\}(?)|\p{Ll})+(?(O)(?!))

每个Match将有一个名为"Target"的Group ，每个此类Group将在每次出现时都有一个Capture – 您只关心这些捕获。

Regex Storm的工作示例 – 单击Table选项卡并检查${Target}的4个捕获

也可以看看：

什么是正则表达式平衡组？