c＃：crawler项目

我可以很容易地按照以下代码示例：

使用浏览器控件向目标网站发起请求。
从目标网站捕获响应。
将响应转换为DOM对象。
迭代DOM对象并捕获“FirstName”，“LastName”等内容，如果它们是响应的一部分。

谢谢

下面是使用WebRequest对象检索数据并将响应捕获为流的代码。

  public static Stream GetExternalData( string url, string postData, int timeout ) { ServicePointManager.ServerCertificateValidationCallback += delegate( object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors ) { // if we trust the callee implicitly, return true...otherwise, perform validation logic return [bool]; }; WebRequest request = null; HttpWebResponse response = null; try { request = WebRequest.Create( url ); request.Timeout = timeout; // force a quick timeout if( postData != null ) { request.Method = "POST"; request.ContentType = "application/x-www-form-urlencoded"; request.ContentLength = postData.Length; using( StreamWriter requestStream = new StreamWriter( request.GetRequestStream(), System.Text.Encoding.ASCII ) ) { requestStream.Write( postData ); requestStream.Close(); } } response = (HttpWebResponse)request.GetResponse(); } catch( WebException ex ) { Log.LogException( ex ); } finally { request = null; } if( response == null || response.StatusCode != HttpStatusCode.OK ) { if( response != null ) { response.Close(); response = null; } return null; } return response.GetResponseStream(); }

为了管理响应，我使用了一个自定义的Xhtml解析器，但它有数千行代码。有几种公开的解析器（参见Darin的评论）。

编辑：根据OP的问题，可以将标头添加到请求以模拟用户代理。例如：

 request = (HttpWebRequest)WebRequest.Create( url ); request.Accept = "application/x-ms-application, image/jpeg, application/xaml+xml, image/gif, image/pjpeg, application/x-ms-xbap, application/x-shockwave-flash, */*"; request.Timeout = timeout; request.Headers.Add( "Cookie", cookies ); // // manifest as a standard user agent request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)";

在这里，您可以找到4个部分到您想要的教程。

这是第一个，这里有4个部分（如何编写搜索引擎）

您可以查看Html Agility Pack和/或SgmlReader 。这是一个使用SgmlReader的示例，它选择DOM中包含一些文本的所有节点：

 class Program { static void Main() { using (var reader = new SgmlReader()) { reader.Href = "http://www.microsoft.com"; var doc = new XmlDocument(); doc.Load(reader); var nodes = doc.SelectNodes("//*[contains(text(), 'Products')]"); foreach (XmlNode node in nodes) { Console.WriteLine(node.OuterXml); } } } }

如果你想要一个纯粹的C＃方式来遍历网页，一个好看的地方就是WatiN 。它允许您轻松打开Web浏览器并通过C＃代码浏览网页（和操作）。

以下是使用API搜索谷歌的示例（取自他们的文档）

 using System; using WatiN.Core; namespaceWatiNGettingStarted { class WatiNConsoleExample { [STAThread] static void Main(string[] args) { // Open a new Internet Explorer window and // goto the google website. IE ie = new IE("http://www.google.com"); // Find the search text field and type Watin in it. ie.TextField(Find.ByName("q")).TypeText("WatiN"); // Click the Google search button. ie.Button(Find.ByValue("Google Search")).Click(); // Uncomment the following line if you want to close // Internet Explorer and the console window immediately. //ie.Close(); } }

}

您还可以使用selenium轻松遍历DOM并获取字段的值。它还会自动为您打开浏览器。

c＃：crawler项目

如何在Asp.Net网站中包含另一个项目控制台应用程序exe？

以其他用户身份登录时丢失会话/ cookie

如何在Visual Studio中添加SQL Server数据库文件（.mdf）而不安装SQL Server Express Edition？

将rdlc报告与业务对象绑定

为什么告诉jQuery单击我的链接按钮减慢我的页面？

StackTrace文件名未知

动态创建控件并在回发中保存控件值 – ASP.Net C＃

将输入字符串转换为干净，可读且浏览器可接受的路径数据

需要将string / char转换为ascii值

指定asp.net核心1.0 WebAPI.exe应该在program.cs中为prod和dev使用的url（端口）