搜索C＃中名称列表的最快方法

在我的应用程序中，我的内存中可能包含100,000个字符串。我需要找到包含特定关键字的前20个字符串（不区分大小写）。这很容易，我只是运行以下LINQ。

from s in stringList where s.ToLower().Contains(searchWord.ToLower()) select s

但是，我有一种明显的感觉，我可以更快地做到这一点，我需要找到相应的方法，因为我需要每秒多次查看此列表。

另一种选择，虽然需要相当大的内存，但是要预先计算类似后缀数组（字符串中的位置列表，按照它们指向的字符串排序）。

http://en.wikipedia.org/wiki/Suffix_array

如果您要搜索的字符串列表相对静态，这将是最可行的。整个字符串索引列表可以存储在单个元组数组（indexOfString，positionInString）中，您可以使用String.Compare(keyword, 0, target, targetPos, keyword.Length)执行二进制搜索。

因此，如果你有100,000个平均20长度的字符串，那么你需要100,000 * 20 * 2 * sizeof（int）的结构内存。您可以通过将indexOfString和positionInString打包到单个32位int中来减少一半，例如，最低12位中的positionInString和其余高位中的indexOfString。你只需要做一点点摆弄就可以恢复两个值。重要的是要注意结构本身不包含字符串或子字符串。您搜索的字符串仅存在一次。

这基本上会给你一个完整的索引，并允许非常快速地找到任何子字符串（对后缀数组所代表的索引进行二进制搜索），并进行最少的实际字符串比较。

如果内存是亲爱的，原始powershell算法的简单优化将是预先计算唯一字符的字典，并分配序号以表示每个字符。然后为每个字符串预先计算一个位数组，并为字符串中包含的每个唯一字符设置位。由于您的字符串相对较短，因此应该存在相当大量的重新调整BitArrays的可变性（如果您的字符串很长，它将无法正常工作）。然后，您只需计算BitArray或搜索关键字，并仅搜索其中keywordBits & targetBits == keywordBits字符串中的关键字。如果您的字符串被预转换为小写字母，并且只是英文字母，则BitArray可能适合单个int。因此，这需要最少的额外内存，易于实现，并允许您快速筛选出您肯定无法找到关键字的字符串。这可能是一个有用的优化，因为字符串搜索速度很快，但是您使用powershell搜索可以做很多事情。

编辑对于那些感兴趣的人，这是我提出的初始解决方案的基本实现。我使用OP描述的100,000个随机生成的长度字符串运行测试。虽然构建和排序索引花了大约30秒，但是一旦制作完成，搜索关键字3000次的速度对于暴力来说是49,805毫秒，使用索引搜索的速度是18毫秒，所以快了几千倍。如果您很少构建列表，那么我最初构建后缀数组的简单但相对较慢的方法就足够了。有更聪明的方法来构建它更快，但需要比我下面的基本实现更多的编码。

 // little test console app static void Main(string[] args) { var list = new SearchStringList(true); list.Add("Now is the time"); list.Add("for all good men"); list.Add("Time now for something"); list.Add("something completely different"); while (true) { string keyword = Console.ReadLine(); if (keyword.Length == 0) break; foreach (var pos in list.FindAll(keyword)) { Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]); } } } ~~~~~~~~~~~~~~~~~~ // file for the class that implements a simple suffix array using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Collections; namespace ConsoleApplication1 { public class SearchStringList { private List strings = new List(); private List positions = new List(); private bool dirty = false; private readonly bool ignoreCase = true; public SearchStringList(bool ignoreCase) { this.ignoreCase = ignoreCase; } public void Add(string s) { if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big."); this.strings.Add(s); this.dirty = true; for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i)); } public string this[int index] { get { return this.strings[index]; } } public void EnsureSorted() { if (dirty) { this.positions.Sort(Compare); this.dirty = false; } } public IEnumerable FindAll(string keyword) { var idx = IndexOf(keyword); while ((idx >= 0) && (idx < this.positions.Count) && (Compare(keyword, this.positions[idx]) == 0)) { yield return this.positions[idx]; idx++; } } private int IndexOf(string keyword) { EnsureSorted(); // binary search // When the keyword appears multiple times, this should // point to the first match in positions. The following // positions could be examined for additional matches int minP = 0; int maxP = this.positions.Count - 1; while (maxP > minP) { int midP = minP + ((maxP - minP) / 2); if (Compare(keyword, this.positions[midP]) > 0) { minP = midP + 1; } else { maxP = midP; } } if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) { return minP; } else { return -1; } } private int Compare(StringPosition pos1, StringPosition pos2) { int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex); return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase); } private int Compare(string keyword, StringPosition pos2) { return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase); } // Packs index of string, and position within string into a single int. This is // set up for strings no greater than 255 bytes. If longer strings are desired, // the code for the constructor, and extracting ListIndex and StringIndex would // need to be modified accordingly, taking bits from ListIndex and using them // for StringIndex. public struct StringPosition { public static StringPosition NotFound = new StringPosition(-1, 0); private readonly int position; public StringPosition(int listIndex, byte stringIndex) { this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex; } public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } } public byte StringIndex { get { return (byte) (this.position & 0xFF); } } public override string ToString() { return ListIndex.ToString() + ":" + StringIndex; } } } }

寻找子串（不完全匹配）是非常困难的。内置任何东西都无法帮助您解决这个问题。我建议您查看后缀树数据结构，可用于有效地查找子字符串。

您可以将searchWord.ToLower()拉出到本地变量以节省大量字符串操作，顺便说一句。您还可以预先计算stringList的小写版本。如果你不能预先计算，至少使用s.IndexOf(searchWord, StringComparison.InvariantCultureIgnoreCase) != -1 。这节省了昂贵的ToLower呼叫。

您还可以在查询上拍一个.AsParallel。

有一种方法会快得多。但这意味着要寻找精确的单词匹配，而不是使用Containsfunction。

基本上，如果你有内存，你可以创建一个单词词典，它也引用了找到单词的字符串的某种ID（或ID）。

所以Dictionary可能是> 。这里的好处当然是你将很多单词合并到一个较小的集合中。而且，由于字典是在散列表上构建的，所以字典的查找速度非常快。

现在，如果这不是您正在寻找的，您可以搜索内存中的全文搜索库。 SQL Server支持使用索引进行全文搜索，以加快传统通配符搜索之外的过程。但纯粹的内存解决方案肯定会更快。但是，这仍然无法为您提供通配符搜索的确切function。

在这种情况下，您需要的是反向索引。

如果您希望支付多少费用，则可以使用特定于数据库的全文搜索索引，并将索引调整为对每个单词子集进行索引。

或者，您可以使用非常成功的开源项目来实现相同的目标。

您需要使用tokenizer对字符串进行预索引，并构建反向索引文件。我们在Java中有类似的用例，我们必须在一大组数据中拥有非常快速的自动完成function。

您可以查看Lucene.NET ，这是Apache Lucene的端口（Java）。

如果您愿意放弃LINQ，可以使用NHibernate Search。（眨眼）。

另一个选择是在内存中实现预索引，预处理和绕过扫描不需要，看看Knuth-Morris-Pratt算法。

搜索C＃中名称列表的最快方法

如何突出listview中的项目？

枚举为字典键

在XSLT中最后一次出现字符后获取子字符串

如何使用HttpClient发布身份validation

NUnit的Is.Equal对于从generics类派生的类不能可靠地工作吗？

在MailSystem.Net中使用String的参数太多

Unity3D – 获取组件

C＃GUI入门教程

如何使用winforms中的自定义纸张尺寸进行打印

测验游戏应用程序