如何进行平衡组捕获?

假设我有这个文本输入。

tes{}tR{R{abc}aD{mnoR{xyz}}} 

我想提取ff输出:

  R{abc} R{xyz} D{mnoR{xyz}} R{R{abc}aD{mnoR{xyz}}} 

目前,我只能使用msdn中的平衡组方法提取{}组内的内容。 这是模式:

  ^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$ 

有谁知道如何在输出中包含R {}和D {}?

我认为这里需要采用不同的方法。 一旦你匹配第一个更大的群体R{R{abc}aD{mnoR{xyz}}} (请参阅我对可能的拼写错误的评论),你将无法获得内部的子群,因为正则表达式不允许你捕获单个R{ ... }组。

因此,必须有一些方法来捕获而不是消费,显而易见的方法是使用积极的前瞻。 从那里,你可以把你使用的表达,尽管有一些变化,以适应新的焦点变化,我想出了:

 (?=([AZ](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!)))) 

[我还将’打开’重命名为’O’并删除了用于近距离大括号的命名捕获,以使其缩短并避免在比赛中产生噪音]

在regexhero.net(我目前唯一知道的免费.NET正则表达式测试程序)中,我得到了以下捕获组:

 1: R{R{abc}aD{mnoR{xyz}}} 1: R{abc} 1: D{mnoR{xyz}} 1: R{xyz} 

正则表达式的细分:

 (?= # Opening positive lookahead ([AZ] # Opening capture group and any uppercase letter (to match R & D) (?: # First non-capture group opening (?: # Second non-capture group opening (?'O'{) # Get the named opening brace [^{}]* # Any non-brace )+ # Close of second non-capture group and repeat over as many times as necessary (?: # Third non-capture group opening (?'-O'}) # Removal of named opening brace when encountered [^{}]*? # Any other non-brace characters in case there are more nested braces )+ # Close of third non-capture group and repeat over as many times as necessary )+ # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces (?(O)(?!)) # Condition to prevent unbalanced braces ) # Close capture group ) # Close positive lookahead 

以下内容在C#中不起作用

我实际上想要尝试它应该如何在PCRE引擎上运行,因为有选项可以使用递归正则表达式,我认为它更容易,因为我更熟悉它并且产生了更短的正则表达式:)

 (?=([AZ]{(?:[^{}]|(?1))+})) 

regex101演示

 (?= # Opening positive lookahead ([AZ] # Opening capture group and any uppercase letter (to match R & D) { # Opening brace (?: # Opening non-capture group [^{}] # Matches non braces | # OR (?1) # Recurse first capture group )+ # Close non-capture group and repeat as many times as necessary } # Closing brace ) # Close of capture group ) # Close of positive lookahead 

我不确定单个正则表达式是否能够满足您的需求:这些嵌套的子串总是搞乱它。

一种解决方案可能是以下算法(用Java编写,但我想对C#的翻译不会那么难):

 /** * Finds all matches (ie including sub/nested matches) of the regex in the input string. * * @param input * The input string. * @param regex * The regex pattern. It has to target the most nested substrings. For example, given the following input string * A{01B{23}45C{67}89}, if you want to catch every X{*} substrings (where X is a capital letter), * you have to use [AZ][{][^{]+?[}] or [AZ][{][^{}]+[}] instead of [AZ][{].+?[}]. * @param format * The format must follow the  matches = new LinkedHashMap(); Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(input); // if a substring has been found while (matcher.find()) { // create a unique replacement string using the counter String replace = String.format(format, counter++); // store the relation "replacement string --> initial substring" in a queue matches.put(replace, matcher.group()); String end = input.substring(matcher.end(), input.length()); String start = input.substring(0, matcher.start()); // replace the found substring by the created unique replacement string input = start + replace + end; // reiterate on the new input string (faking the original matcher.find() implementation) matcher = pattern.matcher(input); } List> entries = new LinkedList>(matches.entrySet()); // for each relation "replacement string --> initial substring" of the queue for (int i = 0; i < entries.size(); i++) { Entry current = entries.get(i); // for each relation that could have been found before the current one (ie more nested) for (int j = 0; j < i; j++) { Entry previous = entries.get(j); // if the current initial substring contains the previous replacement string if (current.getValue().contains(previous.getKey())) { // replace the previous replacement string by the previous initial substring in the current initial substring current.setValue(current.getValue().replace(previous.getKey(), previous.getValue())); } } } return new LinkedList(matches.values()); } 

因此,在您的情况下:

 String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}"; String regex = "[AZ][{][^{}]+[}]"; findAllMatches(input, regex, null); 

返回:

 R{abc} R{xyz} D{mnoR{xyz}} R{R{abc}aD{mnoR{xyz}}} 

在.Net正则表达式中平衡组可以让您控制准确捕获的内容,并且.Net正则表达式引擎保留组的所有捕获的完整历史记录(与仅捕获每个组的最后一次出现的大多数其他类型)不同。

MSDN示例有点过于复杂。 匹配nestes结构的更简单方法是:

 (?> (?)\p{Lu}\{ # Push to the O stack, and match an upper-case letter and { | # OR \}(?<-O>) # Match } and pop from the stack | # OR \p{Ll} # Match a lower-case letter )+ (?(O)(?!)) # Make sure the stack is empty 

或者在一行中:

 (?>(?)\p{Lu}\{|\}(?<-O>)|\p{Ll})+(?(O)(?!)) 

关于Regex Storm的工作示例

在你的例子中,它也匹配字符串开头的"tes" ,但不要担心,我们还没有完成。

通过小的修正,我们还可以捕获 R{}对之间的出现:

 (?>(?)\p{Lu}\{|\}(?)|\p{Ll})+(?(O)(?!)) 

每个Match将有一个名为"Target"Group ,每个此类Group将在每次出现时都有一个Capture – 您只关心这些捕获。

Regex Storm的工作示例 – 单击Table选项卡并检查${Target}的4个捕获

也可以看看:

  • 什么是正则表达式平衡组?
Interesting Posts