删除两个元素之间的所有元素

我有大约2500个不同标准的html文件。 我需要删除它们的页脚部分。 下面的HTML代码是我的文件页脚之一,我需要删除两个hr元素和两者之间的元素。

到目前为止,我只尝试使用xpath(和HTML Agility Pack) selectSingleNodeDocumentNode.SelectNodes("//hr");定位hr元素DocumentNode.SelectNodes("//hr"); 。 然后尝试用foreach迭代。 但是我太过粗暴地使用XPath了,并且不知道如何选择节点及其兄弟(?)来删除它们。

到目前为止,这是我在社区的帮助下得到的。 🙂

 private static void RemoveHR(IEnumerable files) { var document = new HtmlDocument(); List hr = new List(); List errors = new List(); int i = 0; foreach (var file in files) { try { document.Load(@file); i++; var hrs = document.DocumentNode.SelectNodes("//hr"); foreach (var hr in hrs) hr.Remove(); document.Save(@file); } catch (Exception Ex) { errors.Add(file + "|" + Ex.Message); } } using (StreamWriter logger = File.CreateText(@"D:\websites\dev.openjournal.tld\public\arkivet\ErrorLogs\hr_error_log.txt")) { foreach (var file in errors) { logger.WriteLine(file); } } int nrOfHr = hr.Count(); int nrOfErrors = errors.Count(); Console.WriteLine("Number of hr elements collected: {0}", nrOfHr); Console.WriteLine("Number of files missing hr element: {0}", nrOfErrors); } 

HTML代码:

 
//start element

How to cite this paper:

Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996)  "Stemming and N-gram matching for term conflation in Turkish texts" Information Research, 1(1) Available at: http://informationr.net/ir/2-2/paper13.html

© the authors, 1996.


Check for citations, using Google Scholar

Contents


Web Counter
Counting only since 13 December 2002

Home


//end element

编辑我使用previous-sibling和follow-sibling对目标节点进行了一些实验。 不幸的是,它不包括列表中的目标节点。

 var footerTags = document.DocumentNode.SelectNodes("//*[preceding-sibling::p[contains(text(),'How to cite this')] and following-sibling::hr[@color = '#ff00ff']]"); 

它找到带有“如何引用此”的文本的段落,然后选择它之间的所有节点,并选择颜色为“ff00ff”的hr。 但是不包括要删除的列表中的实际选定节点,并且需要将它们与所选节点一起删除。

假设开始结束节点真的是相同的 (相同的标签名称,属性和属性值),正如您在上面的评论中提到的那样,它并不太难:

  1. 选择开始节点。
  2. 迭代并删除每个兄弟,包括结束节点。
  3. 删除开始节点。

示例HTML:

 var html = @"   
DO NOT DELETE

//start element

How to cite this paper:

Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996)  "Stemming and N-gram matching for term conflation in Turkish texts" Information Research, 1(1) Available at: http://informationr.net/ir/2-2/paper13.html

© the authors, 1996.


Check for citations, using Google Scholar

Contents


Web Counter
Counting only since 13 December 2002

Home


//end element
DO NOT DELETE
";

解析它:

 var document = new HtmlDocument(); document.LoadHtml(html); var startNode = document.DocumentNode.SelectSingleNode("//hr[@size='3'][@color='#ff00ff']"); // account for mismatched quotes in HTML source var quotesRegex = new Regex("[\"']"); var startNodeNoQuotes = quotesRegex.Replace(startNode.OuterHtml, ""); HtmlNode siblingNode; while ( (siblingNode = startNode.NextSibling) != null) { siblingNode.Remove(); if (quotesRegex.Replace(siblingNode.OuterHtml, "") == startNodeNoQuotes) { break; // end node } } startNode.Remove(); 

结果输出:

    
DO NOT DELETE
//end element
DO NOT DELETE

我想,你期待这个,

 string content = System.IO.File.ReadAllText(@"D:\New Text Document.txt"); string html = Regex.Replace(content, "", "", RegexOptions.Singleline); 

结果

 //start element 

How to cite this paper:

Ekmekçioglu, F. Çuna, Lynch, Michael F. & Willett, Peter (1996)  "Stemming and N-gram matching for term conflation in Turkish texts" Information Research, 1(1) Available at: http://informationr.net/ir/2-2/paper13.html

© the authors, 1996.

Check for citations, using Google Scholar

Contents


Web Counter
Counting only since 13 December 2002

Home

//end element