从数据表中删除重复项的最佳方法是什么？

我检查了整个网站并在网上搜索，但无法找到解决此问题的简单方法。

我有一个大约有20列和10K行的数据表。我需要根据4个关键列删除此数据表中的重复行。不.Net有这样的function吗？最接近我正在寻找的函数是datatable.DefaultView.ToTable（true，要显示的列数组），但是这个函数在所有列上都是不同的。

如果有人可以帮助我，这将是很好的。

编辑：对不起，我很抱歉。通过读取CSV文件而不是从DB创建此数据表。因此，使用SQL查询不是一种选择。

您可以使用Linq到数据集。检查一下。像这样的东西：

// Fill the DataSet. DataSet ds = new DataSet(); ds.Locale = CultureInfo.InvariantCulture; FillDataSet(ds); List rows = new List(); DataTable contact = ds.Tables["Contact"]; // Get 100 rows from the Contact table. IEnumerable query = (from c in contact.AsEnumerable() select c).Take(100); DataTable contactsTableWith100Rows = query.CopyToDataTable(); // Add 100 rows to the list. foreach (DataRow row in contactsTableWith100Rows.Rows) rows.Add(row); // Create duplicate rows by adding the same 100 rows to the list. foreach (DataRow row in contactsTableWith100Rows.Rows) rows.Add(row); DataTable table = System.Data.DataTableExtensions.CopyToDataTable(rows); // Find the unique contacts in the table. IEnumerable uniqueContacts = table.AsEnumerable().Distinct(DataRowComparer.Default); Console.WriteLine("Unique contacts:"); foreach (DataRow uniqueContact in uniqueContacts) { Console.WriteLine(uniqueContact.Field("ContactID")); }

如何删除重复的行？。（调整那里的查询以加入您的4个关键列）

编辑：使用您的新信息我相信最简单的方法是实现IEqualityComparer 并在数据行上使用Distinct。否则，如果你正在使用IEnumerable / IList而不是DataTable / DataRow，那么一些LINQ-to-objects kung-fu肯定是可能的。

编辑：示例IEqualityComparer

 public class MyRowComparer : IEqualityComparer { public bool Equals(DataRow x, DataRow y) { return (x.Field("ID") == y.Field("ID")) && string.Compare(x.Field("Name"), y.Field("Name"), true) == 0 && ... // extend this to include all your 4 keys... } public int GetHashCode(DataRow obj) { return obj.Field("ID").GetHashCode() ^ obj.Field("Name").GetHashCode() etc. } }

你可以像这样使用它：

 var uniqueRows = myTable.AsEnumerable().Distinct(MyRowComparer);

如果您有权访问Linq我认为您应该能够使用内存集合中的内置组function并选择重复的行

通过示例搜索Google以获取Linq Group

应该考虑必须调用Table.AcceptChanges（）来完成删除。否则，删除的行仍然存在于DataTable中，RowState设置为Deleted。删除后，Table.Rows.Count不会更改。

我认为这必须是使用Linq和moreLinq代码从Datatable中删除重复项的最佳方法：

LINQ

 RemoveDuplicatesRecords(yourDataTable); private DataTable RemoveDuplicatesRecords(DataTable dt) { var UniqueRows = dt.AsEnumerable().Distinct(DataRowComparer.Default); DataTable dt2 = UniqueRows.CopyToDataTable(); return dt2; }

博客文章：从DataTable Asp.net删除重复行记录c＃

MoreLinq

 // Distinctby column name ID var valueDistinctByIdColumn = yourTable.AsEnumerable().DistinctBy(row => new { Id = row["Id"] }); DataTable dtDistinctByIdColumn = valueDistinctByIdColumn.CopyToDataTable();

注意： moreLinq需要添加库。

在morelinq中，您可以使用名为DistinctBy的函数，您可以在其中指定要查找Distinct对象的属性。

博客文章：使用moreLinq DistinctBy方法删除重复记录

Liggett78的答案要好得多 – 尤其是因为我的错误！更正如下……

 DELETE TableWithDuplicates FROM TableWithDuplicates LEFT OUTER JOIN ( SELECT PK_ID = Min(PK_ID), --Decide your method for deciding which rows to keep KeyColumn1, KeyColumn2, KeyColumn3, KeyColumn4 FROM TableWithDuplicates GROUP BY KeyColumn1, KeyColumn2, KeyColumn3, KeyColumn4 ) AS RowsToKeep ON TableWithDuplicates.PK_ID = RowsToKeep.PK_ID WHERE RowsToKeep.PK_ID IS NULL

在bytes.com上找到了这个：

您可以将JET 4.0 OLE DB提供程序与System.Data.OleDb命名空间中的类一起使用，以访问逗号分隔的文本文件（使用DataSet / DataTable）。

或者，您可以使用Microsoft Text Driver for ODBC和System.Data.Odbc命名空间中的类来使用ODBC驱动程序访问该文件。

这将允许您通过SQL查询访问您的数据，正如其他人提出的那样。

“这个数据表是通过读取CSV文件而不是从数据库创建的。”

因此，对数据库中的四列放置一个唯一约束，并且在您的设计下插入的重复项将不会进入。除非它决定失败而不是在发生这种情况时继续，但这肯定可以在CSV导入脚本中配置。

使用查询而不是函数：

 DELETE FROM table1 AS tb1 INNER JOIN (SELECT id, COUNT(id) AS cntr FROM table1 GROUP BY id) AS tb2 ON tb1.id = tb2.id WHERE tb2.cntr > 1

这是一个非常简单的代码，它不需要linq或单独的列来进行过滤。如果一行中列的所有值都为null，则将删除它。

  public DataSet duplicateRemoval(DataSet dSet) { bool flag; int ccount = dSet.Tables[0].Columns.Count; string[] colst = new string[ccount]; int p = 0; DataSet dsTemp = new DataSet(); DataTable Tables = new DataTable(); dsTemp.Tables.Add(Tables); for (int i = 0; i < ccount; i++) { dsTemp.Tables[0].Columns.Add(dSet.Tables[0].Columns[i].ColumnName, System.Type.GetType("System.String")); } foreach (System.Data.DataRow row in dSet.Tables[0].Rows) { flag = false; p = 0; foreach (System.Data.DataColumn col in dSet.Tables[0].Columns) { colst[p++] = row[col].ToString(); if (!string.IsNullOrEmpty(row[col].ToString())) { //Display only if any of the data is present in column flag = true; } } if (flag == true) { DataRow myRow = dsTemp.Tables[0].NewRow(); //Response.Write(""); for (int kk = 0; kk < ccount; kk++) { myRow[kk] = colst[kk]; // Response.Write("" + colst[kk] + ""); } dsTemp.Tables[0].Rows.Add(myRow); } } return dsTemp; }

这甚至可以用于从Excel工作表中删除空数据。

试试这个

让我们考虑dtInput是具有重复记录的数据表。

我有一个新的DataTable dtFinal，我想在其中过滤重复的行。

所以我的代码将如下所示。

 DataTable dtFinal = dtInput.DefaultView.ToTable(true, new string[ColumnCount] {"Col1Name","Col2Name","Col3Name",...,"ColnName"});

我并不热衷于使用上面的Linq解决方案，所以我写了这个：

 ///  /// Takes a datatable and a column index, and returns a datatable without duplicates /// 
 /// The datatable containing duplicate records /// The column index containing duplicates /// A datatable object without duplicated records public DataTable duplicateRemoval(DataTable dt, int ComparisonFieldIndex) { try { //Build the new datatable that will be returned DataTable dtReturn = new DataTable(); for (int i = 0; i < dt.Columns.Count; i++) { dtReturn.Columns.Add(dt.Columns[i].ColumnName, System.Type.GetType("System.String")); } //Loop through each record in the datatable we have been passed foreach (DataRow dr in dt.Rows) { bool Found = false; //Loop through each record already present in the datatable being returned foreach (DataRow dr2 in dtReturn.Rows) { bool Identical = true; //Compare the column specified to see if it matches an existing record if (!(dr2[ComparisonFieldIndex].ToString() == dr[ComparisonFieldIndex].ToString())) { Identical = false; } //If the record found identically matches one we already have, don't add it again if (Identical) { Found = true; break; } } //If we didn't find a matching record, we'll add this one if (!Found) { DataRow drAdd = dtReturn.NewRow(); for (int i = 0; i < dtReturn.Columns.Count; i++) { drAdd[i] = dr[i]; } dtReturn.Rows.Add(drAdd); } } return dtReturn; } catch (Exception) { //Return the original datatable if something failed above return dt; } }

此外，这适用于所有列而不是特定的列索引：

 ///  /// Takes a datatable and returns a datatable without duplicates /// 
 /// The datatable containing duplicate records /// A datatable object without duplicated records public DataTable duplicateRemoval(DataTable dt) { try { //Build the new datatable that will be returned DataTable dtReturn = new DataTable(); for (int i = 0; i < dt.Columns.Count; i++) { dtReturn.Columns.Add(dt.Columns[i].ColumnName, System.Type.GetType("System.String")); } //Loop through each record in the datatable we have been passed foreach (DataRow dr in dt.Rows) { bool Found = false; //Loop through each record already present in the datatable being returned foreach (DataRow dr2 in dtReturn.Rows) { bool Identical = true; //Compare all columns to see if they match the existing record for (int i = 0; i < dt.Columns.Count; i++) { if (!(dr2[i].ToString() == dr[i].ToString())) { Identical = false; } } //If the record found identically matches one we already have, don't add it again if (Identical) { Found = true; break; } } //If we didn't find a matching record, we'll add this one if (!Found) { DataRow drAdd = dtReturn.NewRow(); for (int i = 0; i < dtReturn.Columns.Count; i++) { drAdd[i] = dr[i]; } dtReturn.Rows.Add(drAdd); } } return dtReturn; } catch (Exception) { //Return the original datatable if something failed above return dt; } }

从数据表中删除重复项的最佳方法是什么？

MoreLinq

WPF Combobox DefaultValue（请选择）

使用DateTime.Now有什么问题。作为唯一ID的主要部分？

将Eval与ASP.NET中的ImageURL绑定

何时使用NaN或+/-无限？

在C＃中限制文件大servlets器端

错误：序列化Entity Framework类

如何为.Net应用程序选择Oracle提供程序？

C＃ToDictionary lambda选择索引和元素？

无法安装NuGet包

ColdFusion – cfusion_encrypt（）和cfusion_decrypt（） – C＃替代方案

从数据表中删除重复项的最佳方法是什么？

MoreLinq

WPF Combobox DefaultValue（请选择）

使用DateTime.Now有什么问题。 作为唯一ID的主要部分？

将Eval与ASP.NET中的ImageURL绑定

何时使用NaN或+/-无限？

在C＃中限制文件大servlets器端

错误：序列化Entity Framework类

如何为.Net应用程序选择Oracle提供程序？

C＃ToDictionary lambda选择索引和元素？

无法安装NuGet包

ColdFusion – cfusion_encrypt（）和cfusion_decrypt（） – C＃替代方案

使用DateTime.Now有什么问题。作为唯一ID的主要部分？