Damerau – Levenshtein距离,增加一个门槛

我有以下实现,但我想添加一个阈值,所以如果结果将大于它,只需停止计算并返回。

我该怎么办呢?

编辑:这是我目前的代码, threshold尚未使用…目标是它被使用

  public static int DamerauLevenshteinDistance(string string1, string string2, int threshold) { // Return trivial case - where they are equal if (string1.Equals(string2)) return 0; // Return trivial case - where one is empty if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2)) return (string1 ?? "").Length + (string2 ?? "").Length; // Ensure string2 (inner cycle) is longer if (string1.Length > string2.Length) { var tmp = string1; string1 = string2; string2 = tmp; } // Return trivial case - where string1 is contained within string2 if (string2.Contains(string1)) return string2.Length - string1.Length; var length1 = string1.Length; var length2 = string2.Length; var d = new int[length1 + 1, length2 + 1]; for (var i = 0; i <= d.GetUpperBound(0); i++) d[i, 0] = i; for (var i = 0; i <= d.GetUpperBound(1); i++) d[0, i] = i; for (var i = 1; i <= d.GetUpperBound(0); i++) { for (var j = 1; j  1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1]) d[i, j] = Math.Min(d[i, j], d[i - 2, j - 2] + cost); } } return d[d.GetUpperBound(0), d.GetUpperBound(1)]; } } 

这是关于你的答案: Damerau – Levenshtein距离,增加一个门槛 (抱歉无法评论,因为我还没有50个代表)

我想你在这里犯了一个错误。 你初始化了:

 var minDistance = threshold; 

你的更新规则是:

 if (d[i, j] < minDistance) minDistance = d[i, j]; 

此外,您提前退出的标准是:

 if (minDistance > threshold) return int.MaxValue; 

现在,观察上面的if条件永远不会成立! 您应该将minDistance初始化为int.MaxValue

这是我能想到的最优雅的方式。 设置d的每个索引后,查看它是否超过您的阈值。 评估是恒定时间,因此与整体算法的理论N ^ 2复杂度相比,它是一个下降:

 public static int DamerauLevenshteinDistance(string string1, string string2, int threshold) { ... for (var i = 1; i <= d.GetUpperBound(0); i++) { for (var j = 1; j <= d.GetUpperBound(1); j++) { ... var temp = d[i,j] = Math.Min(del, Math.Min(ins, sub)); if (i > 1 && j > 1 && string1[i - 1] == string2[j - 2] && string1[i - 2] == string2[j - 1]) temp = d[i,j] = Math.Min(temp, d[i - 2, j - 2] + cost); //Does this value exceed your threshold? if so, get out now if(temp > threshold) return temp; } } return d[d.GetUpperBound(0), d.GetUpperBound(1)]; } 

您还将此问题称为SQL CLR UDF问题,因此我将在该特定上下文中回答:您最好的选择不会来自优化Levenshtein距离,而是来自减少您比较的对数。 是的,一个更快的Levenshtein算法可以改善一些事情,但几乎不会减少从N平方(数百万行中的N)到N *某个因子的比较次数。 我的建议是只比较长度差异在可容忍的delta内的元素。 在大表上,在LEN(Data)上添加一个持久计算列,然后使用include Data在其上创建索引:

 ALTER TABLE Table ADD LenData AS LEN(Data) PERSISTED; CREATE INDEX ndxTableLenData on Table(LenData) INCLUDE (Data); 

现在, 如果您的数据的LEN(Data)变化很大 ,您可以通过加入长度上的最大差异来限制纯粹的问题空间(例如说5):

 SELECT a.Data, b.Data, dbo.Levenshtein(a.Data, b.Data) FROM Table A JOIN Table B ON B.DataLen BETWEEN A.DataLen - 5 AND A.DataLen+5 

终于明白了……虽然它没有我希望的那样有益

  public static int DamerauLevenshteinDistance(string string1, string string2, int threshold) { // Return trivial case - where they are equal if (string1.Equals(string2)) return 0; // Return trivial case - where one is empty if (String.IsNullOrEmpty(string1) || String.IsNullOrEmpty(string2)) return (string1 ?? "").Length + (string2 ?? "").Length; // Ensure string2 (inner cycle) is longer if (string1.Length > string2.Length) { var tmp = string1; string1 = string2; string2 = tmp; } // Return trivial case - where string1 is contained within string2 if (string2.Contains(string1)) return string2.Length - string1.Length; var length1 = string1.Length; var length2 = string2.Length; var d = new int[length1 + 1, length2 + 1]; for (var i = 0; i <= d.GetUpperBound(0); i++) d[i, 0] = i; for (var i = 0; i <= d.GetUpperBound(1); i++) d[0, i] = i; for (var i = 1; i <= d.GetUpperBound(0); i++) { var im1 = i - 1; var im2 = i - 2; var minDistance = threshold; for (var j = 1; j <= d.GetUpperBound(1); j++) { var jm1 = j - 1; var jm2 = j - 2; var cost = string1[im1] == string2[jm1] ? 0 : 1; var del = d[im1, j] + 1; var ins = d[i, jm1] + 1; var sub = d[im1, jm1] + cost; //Math.Min is slower than native code //d[i, j] = Math.Min(del, Math.Min(ins, sub)); d[i, j] = del <= ins && del <= sub ? del : ins <= sub ? ins : sub; if (i > 1 && j > 1 && string1[im1] == string2[jm2] && string1[im2] == string2[jm1]) d[i, j] = Math.Min(d[i, j], d[im2, jm2] + cost); if (d[i, j] < minDistance) minDistance = d[i, j]; } if (minDistance > threshold) return int.MaxValue; } return d[d.GetUpperBound(0), d.GetUpperBound(1)] > threshold ? int.MaxValue : d[d.GetUpperBound(0), d.GetUpperBound(1)]; }