如何仅从网站上删除标记

我正在开发一个webcrawler。目前我刮掉整个内容，然后使用正则表达式我删除, , 和其他标签，并获取正文的内容。

但是，我正在尝试优化性能，我想知道是否有一种方法可以只刮掉页面的？

 namespace WebScrapper { public static class KrioScraper { public static string scrapeIt(string siteToScrape) { string HTML = getHTML(siteToScrape); string text = stripCode(HTML); return text; } public static string getHTML(string siteToScrape) { string response = ""; HttpWebResponse objResponse; HttpWebRequest objRequest = (HttpWebRequest) WebRequest.Create(siteToScrape); objRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; " + "Windows NT 5.1; .NET CLR 1.0.3705)"; objResponse = (HttpWebResponse) objRequest.GetResponse(); using (StreamReader sr = new StreamReader(objResponse.GetResponseStream())) { response = sr.ReadToEnd(); sr.Close(); } return response; } public static string stripCode(string the_html) { // Remove google analytics code and other JS the_html = Regex.Replace(the_html, "<script.*?", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove inline stylesheets the_html = Regex.Replace(the_html, "<style.*?", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); // Remove HTML tags the_html = Regex.Replace(the_html, "</?[az][a-z0-9]*[^]*>", ""); // Remove HTML comments the_html = Regex.Replace(the_html, "", ""); // Remove Doctype the_html = Regex.Replace(the_html, "", ""); // Remove excessive whitespace the_html = Regex.Replace(the_html, "[\t\r\n]", " "); return the_html; } } }

从Page_Load我调用scrapeIt()方法传递给我从页面的文本框中获取的字符串。

我认为你最好的选择是使用一个轻量级的HTML解析器（类似于Majestic 12 ，它基于我的测试比HTML Agility Pack快大约50-100％）并且只处理你感兴趣的节点（任何之间的任何东西）和）。 Majestic 12比HTML Agility Pack更难使用，但如果您正在寻找性能，那么它肯定会对您有所帮助！

这将使您了解所要求的内容，但您仍需要下载整个页面。我不认为有办法解决这个问题。您将节省的是实际为所有其他内容生成DOM节点（除了正文）。您将不得不解析它们，但您可以跳过您不想处理的节点的整个内容。

这是一个如何使用M12解析器的好例子。

我没有一个关于如何抓住身体的准备好的例子，但我确实有一个如何只抓住链接并且很少修改就会到达那里。这是粗略的版本：

 GrabBody(ParserTools.OpenM12Parser(_response.BodyBytes));

您需要打开M12 Parser（M12附带的示例项目具有详细说明所有这些选项如何影响性能的评论，以及它们！）：

 public static HTMLparser OpenM12Parser(byte[] buffer) { HTMLparser parser = new HTMLparser(); parser.SetChunkHashMode(false); parser.bKeepRawHTML = false; parser.bDecodeEntities = true; parser.bDecodeMiniEntities = true; if (!parser.bDecodeEntities && parser.bDecodeMiniEntities) parser.InitMiniEntities(); parser.bAutoExtractBetweenTagsOnly = true; parser.bAutoKeepScripts = true; parser.bAutoMarkClosedTagsWithParamsAsOpen = true; parser.CleanUp(); parser.Init(buffer); return parser; }

解析身体：

 public void GrabBody(HTMLparser parser) { // parser will return us tokens called HTMLchunk -- warning DO NOT destroy it until end of parsing // because HTMLparser re-uses this object HTMLchunk chunk = null; // we parse until returned oChunk is null indicating we reached end of parsing while ((chunk = parser.ParseNext()) != null) { switch (chunk.oType) { // matched open tag, ie  case HTMLchunkType.OpenTag: if (chunk.sTag == "body") { // Start generating the DOM node (as shown in the previous example link) } break; // matched close tag, ie  case HTMLchunkType.CloseTag: break; // matched normal text case HTMLchunkType.Text: break; // matched HTML comment, that's stuff between  case HTMLchunkType.Comment: break; }; } }

生成DOM节点很棘手，但Majestic12ToXml类将帮助您实现这一目标。就像我说的那样，这绝不等同于你在HTML敏捷包中看到的3-liner，但是一旦你得到了工具，你将能够获得所需的一小部分性能成本，可能就像许多行代码。

我建议利用HTML Agility Pack进行HTML解析/操作。

您可以轻松选择这样的身体：

 var webGet = new HtmlWeb(); var document = webGet.Load(url); document.DocumentNode.SelectSingleNode("//body")

仍然是最简单/最快（最不准确）的方法。

 int start = response.IndexOf("


 显然，如果HEAD标签中有javascript，如… 
 document.write(""); 
 那么你最终会得到一点你想要的。



  为什么命名空间类型不应该依赖于嵌套的命名空间类型？
  asp5 IConfigurationRoot获取json数组
	捕获任何类型的击键（也称为键盘记录），最好是c＃.net，但任何类型都可以
使用属性定义C＃枚举的多种方法？
刷新日志到磁盘，在VerifyOSHandlePosition中例外
在C＃.NET中使用USB PS2手动控制器
如何在C＃中转义字符串，以便在LDAP查询中使用
如何检查文件当前是打开还是在.NET中写入？
TakeWhile，但也得到了阻止它的元素
如何附加到表达式
Linq ToList / ToArray / ToDictionary性能

如何仅从网站上删除标记

Linq-to-Sql SubmitChanges没有更新字段……为什么？

C＃导入Adobe Illustrator（.AI）文件渲染到Bitmap？

如何忽略二进制序列化的Event类成员？

BUG触发if和else同时Unity C＃

什么是编写XML的最快方法

如何在数据表的每个列中获取最大字符串长度

如何以编程方式确定是在多核，超线程还是多处理器上？

打开使用System.IO.Compression创建的ZipArchive时出现C＃.NET缺失方法exception

DocumentDb在transactioncope中写入

.NET Windows窗体应用程序更新自身的最佳方法是什么？