htmlagilitypack – 删除脚本和样式?

我使用以下方法提取文本formshtml:

public string getAllText(string _html) { string _allText = ""; try { HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument(); document.LoadHtml(_html); var root = document.DocumentNode; var sb = new StringBuilder(); foreach (var node in root.DescendantNodesAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) sb.AppendLine(text.Trim()); } } _allText = sb.ToString(); } catch (Exception) { } _allText = System.Web.HttpUtility.HtmlDecode(_allText); return _allText; } 

问题是我也得到了脚本和样式标签。

我怎么能排除他们?

 HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); doc.DocumentNode.Descendants() .Where(n => n.Name == "script" || n.Name == "style") .ToList() .ForEach(n => n.Remove()); 

您可以使用HtmlDocument类执行此操作:

 HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(input); doc.DocumentNode.SelectNodes("//style|//script").ToList().ForEach(n => n.Remove()); 

一些优秀的答案,System.Linq很方便!

对于非基于Linq的方法:

 private HtmlAgilityPack.HtmlDocument RemoveScripts(HtmlAgilityPack.HtmlDocument webDocument) { // Get all Nodes: script HtmlAgilityPack.HtmlNodeCollection Nodes = webDocument.DocumentNode.SelectNodes("//script"); // Make sure not Null: if (Nodes == null) return webDocument; // Remove all Nodes: foreach (HtmlNode node in Nodes) node.Remove(); return webDocument; }