使用C＃从驻留在DB中的字符串集自动生成Regex

我在数据库中有大约100,000个字符串，我想如果有办法从这些字符串自动生成正则表达式模式。所有这些都是字母字符串，并使用英文字母组成的字母表。例如，不使用（X，W，V）。是否有任何函数或库可以帮助我在C＃中实现这个目标？示例字符串是

KHTK
RAZ

鉴于这两个字符串，我的目标是生成一个正则表达式，允许像（k，kh，kht，khtk，r，ra，raz）这样的模式当然不区分大小写。我已经下载并使用了一些有助于生成正则表达式的C＃应用程序，但这在我的场景中没用，因为我想要一个进程，我从db中顺序读取字符串并将规则添加到regex，这样这个正则表达式可以在以后的应用程序中重用或者保存在磁盘上。

我是正则表达式模式的新手，不知道我问的问题是否可能。如果不可能，请建议我一些替代方法。

一个简单的（有些人可能会说是天真的）方法是创建一个连接所有搜索字符串的正则表达式模式，由交替运算符分隔| ：

对于你的示例字符串，这将获得KHTK|RAZ 。
为了获得正则表达式捕获前缀，我们将在模式中包括这些前缀，例如K|KH|KHT|KHTK|R|RA|RAZ 。
最后，为了确保这些字符串仅在整个字符串中捕获，而不是作为较大字符串的一部分，我们将分别匹配行首和行尾操作符以及每个字符串的开头和结尾： ^K$|^KH$|^KHT$|^KHTK$|^R$|^RA$|^RAZ$

我们希望Regex类实现能够将长正则表达式模式字符串转换为高效的匹配器。

此处的示例程序生成10,000个随机字符串，以及与这些字符串及其所有前缀完全匹配的正则表达式。然后程序validation正则表达式确实只匹配那些字符串，并计算所需的时间。

 using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; namespace ConsoleApplication { class Program { private static Random r = new Random(); // Create a string with randomly chosen letters, of a randomly chosen // length between the given min and max. private static string RandomString(int minLength, int maxLength) { StringBuilder b = new StringBuilder(); int length = r.Next(minLength, maxLength); for (int i = 0; i < length; ++i) { b.Append(Convert.ToChar(65 + r.Next(26))); } return b.ToString(); } static void Main(string[] args) { int stringCount = 10000; // number of random strings to generate StringBuilder pattern = new StringBuilder(); // our regular expression under construction HashSet strings = new HashSet(); // a set of the random strings (and their // prefixes) we created, for verifying the // regex correctness // generate random strings, track their prefixes in the set, // and add their prefixes to our regular expression for (int i = 0; i < stringCount; ++i) { // make a random string, 2-5 chars long string nextString = RandomString(2, 5); // for each prefix of the random string... for (int prefixLength = 1; prefixLength <= nextString.Length; ++prefixLength) { string prefix = nextString.Substring(0, prefixLength); // ...add it to both the set and our regular expression pattern if (!strings.Contains(prefix)) { strings.Add(prefix); pattern.Append(((pattern.Length > 0) ? "|" : "") + "^" + prefix + "$"); } } } // create a regex from the pattern (and time how long that takes) DateTime regexCreationStartTime = DateTime.Now; Regex r = new Regex(pattern.ToString()); DateTime regexCreationEndTime = DateTime.Now; // make sure our regex correcly matches all the strings, and their // prefixes (and time how long that takes as well) DateTime matchStartTime = DateTime.Now; foreach (string s in strings) { if (!r.IsMatch(s)) { Console.WriteLine("uh oh!"); } } DateTime matchEndTime = DateTime.Now; // generate some new random strings, and verify that the regex // indeed does not match the ones it's not supposed to. for (int i = 0; i < 1000; ++i) { string s = RandomString(2, 5); if (!strings.Contains(s) && r.IsMatch(s)) { Console.WriteLine("uh oh!"); } } Console.WriteLine("Regex create time: {0} millisec", (regexCreationEndTime - regexCreationStartTime).TotalMilliseconds); Console.WriteLine("Average match time: {0} millisec", (matchEndTime - matchStartTime).TotalMilliseconds / stringCount); Console.ReadLine(); } } }

在Intel Core2盒子上，我得到10,000个字符串的以下数字：

 Regex create time: 46 millisec Average match time: 0.3222 millisec

当字符串数量增加10倍（达到100,000）时，我得到：

 Regex create time: 288 millisec Average match time: 1.25577 millisec

这个更高，但增长小于线性。

该应用程序的内存消耗（10,000个字符串）开始于~9MB，最高达23MB，必须包括正则表达式和字符串集，并且最后降至~16MB（垃圾收集开始？）从中得出自己的结论那个 - 程序没有优化来从其他数据结构中挑出正则表达式内存消耗。

使用C＃从驻留在DB中的字符串集自动生成Regex

将嵌套的xml反序列化为C＃对象

MVC asp.net序列化如何在Controller动作上为Json对象工作？

我可以像在Autofac中那样在Unity中的模块中注册我的类型吗？

默认值设置为0的SqlParameter无法按预期工作

有什么方法可以避免C＃中的Property内联优化？

eBay C＃SDK不适用于Mono

urlmon.dll FindMimeFromData（）在64位桌面/控制台上运行良好，但在ASP.NET上生成错误

将Windows Phone 7应用程序中的图像文件上载到PHP

是否可以在C＃/ .Net中将消息记录到cmd.exe？

请求已中止：请求已取消。没有解决方案

使用C＃从驻留在DB中的字符串集自动生成Regex

将嵌套的xml反序列化为C＃对象

MVC asp.net序列化如何在Controller动作上为Json对象工作？

我可以像在Autofac中那样在Unity中的模块中注册我的类型吗？

默认值设置为0的SqlParameter无法按预期工作

有什么方法可以避免C＃中的Property内联优化？

eBay C＃SDK不适用于Mono

urlmon.dll FindMimeFromData（）在64位桌面/控制台上运行良好，但在ASP.NET上生成错误

将Windows Phone 7应用程序中的图像文件上载到PHP

是否可以在C＃/ .Net中将消息记录到cmd.exe？

请求已中止：请求已取消。 没有解决方案

请求已中止：请求已取消。没有解决方案