在.NET中使用regex提取URL

我从以下URL csharp-online中的示例节目中获取灵感，并打算从此页面检索所有URL alexa

using System; using System.Collections; using System.Collections.Generic; using System.Linq; using System.Text; using System.Net; using System.Text.RegularExpressions; namespace ExtractingUrls { class Program { static void Main(string[] args) { WebClient client = new WebClient(); const string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology"; string source = client.DownloadString(url); //Console.WriteLine(Getvals(source)); string matchPattern = @"<a.rel=""nofollow"".style=""font-size:0.8em;"".href=[""'](?[^""^']+[.]*)[""'].class=""offsite"".*>(?[^<]+[.]*)"; foreach (Hashtable grouping in ExtractGroupings(source, matchPattern, true)) { foreach (DictionaryEntry DE in grouping) { Console.WriteLine("Value = " + DE.Value); Console.WriteLine(""); } } // End. Console.ReadLine(); } public static ArrayList ExtractGroupings(string source, string matchPattern, bool wantInitialMatch) { ArrayList keyedMatches = new ArrayList(); int startingElement = 1; if (wantInitialMatch) { startingElement = 0; } Regex RE = new Regex(matchPattern, RegexOptions.Multiline); MatchCollection theMatches = RE.Matches(source); foreach (Match m in theMatches) { Hashtable groupings = new Hashtable(); for (int counter = startingElement; counter < m.Groups.Count; counter++) { // If we had just returned the MatchCollection directly, the // GroupNameFromNumber method would not be available to use groupings.Add(RE.GroupNameFromNumber(counter), m.Groups[counter]); } keyedMatches.Add(groupings); } return (keyedMatches); } } }

但是在这里我遇到了一个问题，当我执行每个URL时显示三次，这是首先显示整个锚标记，然后显示两次URL。任何人都可以建议我应该在哪里纠正，以便我可以让每个url只显示一次。

在你的正则表达式中，你有两个分组，以及整个匹配。如果我正确地阅读它，你应该只想要匹配的URL部分，这是3个分组中的第二个….

而不是这个：

 for (int counter = startingElement; counter < m.Groups.Count; counter++) { // If we had just returned the MatchCollection directly, the // GroupNameFromNumber method would not be available to use groupings.Add(RE.GroupNameFromNumber(counter), m.Groups[counter]); }

你不想要这个吗？：

 groupings.Add(RE.GroupNameFromNumber(1),m.Groups[1]);

使用HTML Agility Pack解析HTML。我认为它会让你的问题更容易解决。

这是一种方法：

 WebClient client = new WebClient(); string url = "http://www.alexa.com/topsites/category/Top/Society/History/By_Topic/Science/Engineering_and_Technology"; string source = client.DownloadString(url); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(source); foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href and @rel='nofollow']")) { Console.WriteLine(link.Attributes["href"].Value); }

 int startingElement = 1; if (wantInitialMatch) { startingElement = 0; }

…

 for (int counter = startingElement; counter < m.Groups.Count; counter++) { // If we had just returned the MatchCollection directly, the // GroupNameFromNumber method would not be available to use groupings.Add(RE.GroupNameFromNumber(counter), .Groups[counter]); }

你传递的wantInitialMatch = true ，所以你的for循环返回：

 .Groups[0] //entire match .Groups[1] //(?[^""^']+[.]*) href part .Groups[2] //(?[^<]+[.]*) link text

看看这个： http ： //bouncetadiss.blogspot.com/2008/02/parsing-uri-url-in-c-and-vbnet.html

在.NET中使用regex提取URL

VS2008 – 为调试/发布配置输出不同的文件名

设置线程标识

TypeDescriptor.GetConverter（）不会返回我的转换器

替换字符串中的unicode转义序列

如何检查邮件是否已成功发送

来自C＃进程类的无效操作exception

在C＃中启用和禁用表单

.NET 4.0 MemoryCache性能计数器在哪里？

如何打开备用webbrowser（Mozilla或Firefox）并显示特定URL？

需要C＃中的BouncyCastle PGP文件加密示例