使用HtmlAgilityPack解析dl

这是我尝试使用ASP.Net(C#)中的Html Agility Pack解析的示例HTML。

1
First Entry
2
Second Entry
3
Third Entry

我想要的价值观是:

  • 超链接 – > https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/1.html
  • 锚文本 – > 1
  • 内部文字od dd – > First Entry

(我已经在这里采用了第一个条目的示例,但我想要列表中所有条目的这些元素的值)

这是我目前使用的代码,

 var webGet = new HtmlWeb(); var document = webGet.Load(url2); var parsedValues= from info in document.DocumentNode.SelectNodes("//div[@class='content-div']") from content in info.SelectNodes("dl//dd") from link in info.SelectNodes("dl//dt/b/a") .Where(x => x.Attributes.Contains("href")) select new { Text = content.InnerText, Url = link.Attributes["href"].Value, AnchorText = link.InnerText, }; GridView1.DataSource = parsedValues; GridView1.DataBind(); 

问题是我正确获取了链接和锚文本的值,但是对于它的内部文本,它只取第一个条目的值,并为元素出现的总次数填充所有其他条目的相同值。然后它从第二个开始。 在我的解释中,我可能不太清楚,所以这是我用这段代码得到的示例输出:

 First Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/1.html 1 First Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/2.html 2 First Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/3.html 3 Second Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/1.html 1 Second Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/2.html 2 Second Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/3.html 3 Third Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/1.html 1 Third Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/2.html 2 Third Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/3.html 3 

而我想要得到

 First Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/1.html 1 Second Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/2.html 2 Third Entry https://stackoverflow.com/questions/8942595/parsing-dl-with-htmlagilitypack/3.html 3 

我对HAP很新,对xpath知之甚少,所以我确信我在这里做错了,但即使花了好几个小时,我仍然无法工作。 任何帮助将非常感激。

解决方案1

我已经定义了一个函数,给定一个dt节点将返回它之后的下一个dd节点:

 private static HtmlNode GetNextDDSibling(HtmlNode dtElement) { var currentNode = dtElement; while (currentNode != null) { currentNode = currentNode.NextSibling; if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd") return currentNode; } return null; } 

现在LINQ代码可以转换为:

 var parsedValues = from info in document.DocumentNode.SelectNodes("//div[@class='content-div']") from dtElement in info.SelectNodes("dl/dt") let link = dtElement.SelectSingleNode("b/a[@href]") let ddElement = GetNextDDSibling(dtElement) where link != null && ddElement != null select new { Text = ddElement.InnerHtml, Url = link.GetAttributeValue("href", ""), AnchorText = link.InnerText }; 

解决方案2

没有其他function:

 var infoNode = document.DocumentNode.SelectSingleNode("//div[@class='content-div']"); var dts = infoNode.SelectNodes("dl/dt"); var dds = infoNode.SelectNodes("dl/dd"); var parsedValues = dts.Zip(dds, (dt, dd) => new { Text = dd.InnerHtml, Url = dt.SelectSingleNode("b/a[@href]").GetAttributeValue("href", ""), AnchorText = dt.SelectSingleNode("b/a[@href]").InnerText }); 

例如,你如何使用Html Agility Pack解析一些元素

 public string ParseHtml() { string output = null; HtmlDocument htmldocument = new HtmlDocument(); htmldocument.LoadHtml(YourHTML); HtmlNode node = htmldocument.DocumentNode; HtmlNodeCollection dds = node.SelectNodes("//dd"); //Select all dd tags HtmlNodeCollection anchors = node.SelectNodes("//b/a[@href]"); //Select all 'a' tags that contais href attribute for (int i = 0; i < dds.Count; i++) { string atributteValue = null. Text = dds[i].InnerText; Url = anchors[i].GetAttributeValue("href", atributteValue); AnchorText = anchors[i].InnerText; //Your code... } return output; }