模糊匹配字符串中的多个单词

我正在尝试使用Levenshtein Distance的帮助在OCR页面上找到模糊关键字（静态文本）。
为此，我想给出允许的一定百分比的错误（比如15％）。

string Keyword = "past due electric service";

由于关键字长度为25个字符，我想允许4个错误（25 * .15向上舍入）
我需要能够将它与…进行比较

 string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank you! current electric service total balances unpaid 7 days after the total due date are subject to a late charge of 7.5% of the amount due or $2.00, whichever/5 greater. "

这就是我现在这样做的方式……

 int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202 int NumberOfErrorsAllowed = 4; int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205

显然，在OCR_Text找不到Keyword （它不应该是）。但是，使用Levenshtein的距离，误差的数量小于15％的余地（因此我的逻辑说它已被发现）。

有谁知道更好的方法来做到这一点？

使用子字符串回答了我的问题。发布以防其他人遇到相同类型的问题。有点不正统，但它对我很有用。

 int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have. int LowestLevenshteinNumber = 999999; //initialize insanely high maximum decimal PossibleStringLength = (PossibleString.Length); //Length of string to search decimal StaticTextLength = (StaticText.Length); //Length of text to search for decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage //Look for best match with 1 less character than it should have, then the correct amount of characters. //And last, with 1 more character. (This is because one letter can be recognized as //two (W -> VV) and visa versa) for (int i = 0; i < 3; i++) { for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++) { string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer)); int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero)); int lNumber = LevenshteinAlgorithm(StaticText, possibleResult); if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber))) { PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber }); LowestLevenshteinNumber = lNumber; } } TextLengthBuffer++; } public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm { int n = s.Length; int m = t.Length; int[,] d = new int[n + 1, m + 1]; if (n == 0) { return m; } if (m == 0) { return n; } for (int i = 0; i <= n; d[i, 0] = i++) { } for (int j = 0; j <= m; d[0, j] = j++) { } for (int i = 1; i <= n; i++) { for (int j = 1; j <= m; j++) { int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; d[i, j] = Math.Min( Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } return d[n, m]; }

我认为它不起作用，因为你的一大块字符串是匹配的。所以我要做的就是尝试将你的关键字分成单独的单词。

然后找到OCR_TEXT中匹配这些单词的所有位置。

然后查看所有匹配的地方，看看这些地方中有4个是连续的并且与原始短语相匹配。

我不确定我的解释是否清楚？

模糊匹配字符串中的多个单词

如何使用命令和PDU发送多部分非编码SMS – 不使用文本模式提交？

Raven DB – 让它自动生成自己的密钥

将表单背景颜色设置为自定义颜色

为PictureBox鼠标添加事件

WCF基于params的自定义序列化

在Windows窗体应用程序中重命名控件方法

在字符串中添加数字？

在WPF ComboBox中获取所选项目的标记

PostMessage WM_KEYDOWN发送乘法键？

从HttpWebResponse中检索多个“Set-Cookie”标头