用正则表达式包装的单词

编辑清晰度 – 我知道有多种方法可以在多个步骤中执行此操作,或使用LINQ或vanilla C#字符串操作。 我使用单个正则表达式调用的原因是因为我想练习复杂的正则表达式模式。 – 结束编辑

我正在尝试编写一个将执行自动换行的正则表达式。 它非常接近所需的输出,但我无法让它发挥作用。

Regex.Replace(text, @"(?<=^|\G)(.{1,20}(\s|$))", "$1\r\n", RegexOptions.Multiline) 

这是正确包装太长的行的单词,但它已经存在换行符。

输入

 "This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long." 

预期产出

 "This string is \r\nreally long. There \r\nare a lot of words \r\nin it.\r\nHere's another line \r\nin the string that's \r\nalso very long." 

实际产出

 "This string is \r\nreally long. There \r\nare a lot of words \r\nin it.\r\n\r\nHere's another line \r\nin the string that's \r\nalso very long.\r\n" 

注意输入已经有换行符的句子和最后放置的额外“\ r \ n”之间的双“\ r \ n”。

也许有条件地应用不同的替换模式? IE如果匹配以“\ r \ n”结尾,请使用替换模式“$ 1”,否则,使用替换模式“$ 1 \ r \ n”。

这是一个类似问题的链接,用于包装没有空格的字符串,我用它作为起点。 正则表达式,用于查找完整文本和插入空间

这在Perl中进行了快速测试。

编辑 – 此正则表达式代码模拟MS-Windows Notepad.exe中使用的自动换行(好或坏)

  # MS-Windows "Notepad.exe Word Wrap" simulation # ( N = 16 ) # ============================ # Find: @"(?:((?>.{1,16}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,16})(?:\r?\n)?|(?:\r?\n|$))" # Replace: @"$1\r\n" # Flags: Global # Note - Through trial and error discovery, it apparears Notepad accepts an extra whitespace # (possibly in the N+1 position) to help alignment. This matters not because thier viewport hides it. # There is no trimming of any whitespace, so the wrapped buffer could be reconstituted by inserting/detecting a # wrap point code which is different than a linebreak. # This regex works on un-wrapped source, but could probably be adjusted to produce/work on wrapped buffer text. # To reconstitute the source all that is needed is to remove the wrap code which is probably just an extra "\r". (?: # -- Words/Characters ( # (1 start) (?> # Atomic Group - Match words with valid breaks .{1,16} # 1-N characters # Followed by one of 4 prioritized, non-linebreak whitespace (?: # break types: (?<= [^\S\r\n] ) # 1. - Behind a non-linebreak whitespace [^\S\r\n]? # ( optionally accept an extra non-linebreak whitespace ) | (?= \r? \n ) # 2. - Ahead a linebreak | $ # 3. - EOS | [^\S\r\n] # 4. - Accept an extra non-linebreak whitespace ) ) # End atomic group | .{1,16} # No valid word breaks, just break on the N'th character ) # (1 end) (?: \r? \n )? # Optional linebreak after Words/Characters | # -- Or, Linebreak (?: \r? \n | $ ) # Stand alone linebreak or at EOS ) 

测试用例包装宽度N为16.输出与记事本和各种宽度相匹配。

  $/ = undef; $string1 = ; $string1 =~ s/(?:((?>.{1,16}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,16})(?:\r?\n)?|(?:\r?\n|$))/$1\r\n/g; print $string1; __DATA__ hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh bbbbbbbbbbbbbbbbEDIT FOR CLARITY - I know there are ways to do this in multiple steps, or using LINQ or vanilla C# string manipulation. The reason I am using a single regex call, is because I wanted practice. with complex regex patterns. - END EDIT pppppppppppppppppppUf 

输出>>

  hhhhhhhhhhhhhhhh hhhhhhhhhhhhhhh bbbbbbbbbbbbbbbb EDIT FOR CLARITY - I know there are ways to do this in multiple steps, or using LINQ or vanilla C# string manipulation. The reason I am using a single regex call, is because I wanted practice. with complex regex patterns. - END EDIT pppppppppppppppp pppUf 

我会写一个像这样的扩展方法。

 var input = "This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long."; var lines = input.SplitByLength(20).ToList(); 

 public static partial class MyExtensions { public static IEnumerable SplitByLength(this string input, int maxLen) { return Regex.Split(input, @"(.{1," + maxLen + @"})(?:\s|$)") .Where(x => x.Length > 0) .Select(x => x.Trim()); } } 

OUTPUT

 This string is really long. There are a lot of words in it. Here's another line in the string that's also very long. 

在第一遍中为’\ r \ n’添加一个占位符,然后用\ r \ n替换任何\ r \ n’n’占位符’的值,最后进行第三遍,并用\ r \ n替换左边的占位符。

例如,使用\ u0000作为占位符

这当然只有在原始字符串不包含null时才有效

  string text = "This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long."; Console.WriteLine(text); text = Regex.Replace(text, @"(?<=^|\G)(.{1,20}(\s|$))", "$1\u0000", RegexOptions.Multiline); // break added after original text = Regex.Replace(text, "\r\n\u0000", "\r\n", RegexOptions.Multiline); text = Regex.Replace(text, "\u0000", "\r\n", RegexOptions.Multiline); Console.WriteLine(text); 

如果单个单词长度超过wordwrap的字符数,则不指示您想要发生什么,如果单词长度超过20,我选择以最大字符数(在本例中为20)进行拆分:

 resultString = Regex.Replace(subjectString, @"(.{1,19}\S)(?:\s+|$)|(.{20})", @"$1$2 ", RegexOptions.Multiline); 

在$ 1 $ 2之后有一个LF,不知道它将如何显示在这里。 您可以在那里插入\ r \ n,但是在某种程度上我的模拟器上不起作用:

 resultString = Regex.Replace(subjectString, @"(.{1,19}\S)(?:\s+|$)|(.{20})", @"$1$2\r\n", RegexOptions.Multiline); 

这是一个结合了一些好主意的解决方案。 我从头开始编写了一个正则表达式,发现它与sln提供的正则表达式非常相似,但它有点短,可能会减少回溯:

 # assuming a max line length of 16 (?: [^\r\n]{1,16}(?=\s|$) # non-linebreaking characters followed by a space # or end-of-string, up to the max line length |[^\r\n]{16} # Or for really long words: a sequence of non-breaking # characters exactly the line length |(?<=\n)\r?\n # Or blank lines: a line break following another line break. This works for \n or \r\n styles. ) 

像LB我把正则表达式放在一个扩展方法,WordWrap:

 void Main() { var lineLen = 25; var test1 = "Some random words like calendar boat and breathe.\nAnd an extra line.\n\n\nAnd here's one that has to break in the middle because there are no spaces:\n" + String.Join("", Enumerable.Range(1, lineLen + 5).Select(i => (i % 10).ToString())); var test2 = test1.Replace("\n","\r\n"); StringHelper.StringRuler(lineLen).Dump("ruler"); String.Join("\n", test1.WordWrap(lineLen)).Dump("test 1"); String.Join("\r\n", test2.WordWrap(lineLen)).Dump("test 2"); } public static class StringHelper { public static IEnumerable WordWrap(this string source, int lineLength) { return new Regex( @"(?:[^\r\n]{1,lineLength}(?=\s|$)|[^\r\n]{lineLength}|(?<=\n)\r?\n)" .Replace("lineLength", lineLength.ToString())) .Matches(source) .Cast() // http://stackoverflow.com/a/7274451/555142 .Select(m=>m.Value.Trim()); } public static string StringRuler(int lineLength) { return String.Join("", Enumerable.Range(1, lineLength) .Select(i => ((i % 10) == 0 ? (i / 10).ToString() : " "))) + "\n" + String.Join("", Enumerable.Range(1, lineLength).Select(i => (i % 10).ToString())) + "\n" + String.Join("", Enumerable.Range(1, lineLength).Select(i => "-")); } } 

使用LinqPad进行测试( 即时共享 )。 有两个测试,第一个用于\ n换行符,第二个用于\ r \ n换行符。

 ruler 1 2 1234567890123456789012345 ------------------------- test 1 Some random words like calendar boat and breathe. And an extra line. And here's one that has to break in the middle because there are no spaces: 1234567890123456789012345 67890 test 2 Some random words like calendar boat and breathe. And an extra line. And here's one that has to break in the middle because there are no spaces: 1234567890123456789012345 67890 

我在JS的解决方案:

 function wordWrap(s, width) { var r = '(?:(.{1,' + width + '})[ \\r\\t]+|(.{' + width + '}))(?!$)'; r = new RegExp(r, 'g'); // console.log(r); return s.replace(r, '$1$2\n'); }