如何通过REGEX从String中删除HTML标签?
我从Mysql获取数据,但问题是“HTML标签,即
<p>LARGE</p><p>Lamb;
;li;ul;
也是我的数据提取我只需要从上面的“大”和“羔羊”。 如何从String中分离/删除HTML标记?
试试这个
// erase html tags from a string public static string StripHtml(string target) { //Regular expression for html tags Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled); return StripHTMLExpression.Replace(target, string.Empty); }
呼叫
string htmlString="hello world!"; string strippedString=StripHtml(htmlString);
我将假设HTML完好无损,可能如下所示:
LARGE
Lamb
在这种情况下,我会使用HtmlAgilityPack获取内容而不必诉诸正则表达式。
var html = "LARGE
Lamb
"; var hap = new HtmlDocument(); hap.LoadHtml(html); string text = HtmlEntity.DeEntitize(hap.DocumentNode.InnerText); // text is now "LARGELamb " string[] lines = hap.DocumentNode.SelectNodes("//text()") .Select(h => HtmlEntity.DeEntitize(h.InnerText)).ToArray(); // lines is { "LARGE", "Lamb", " " }
如果我们假设你要修复你的html elements
。
static void Main(string[] args) { string html = WebUtility.HtmlDecode("<p>LARGE</p><p>Lamb</p>"); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); List spanNodes = doc.DocumentNode.Descendants().Where(x => x.Name == "p").ToList(); foreach (HtmlNode node in spanNodes) { Console.WriteLine(node.InnerHtml); } }
您需要使用HTML Agility Pack。您可以像这样添加引用:
Install-Package HtmlAgilityPack
假如说:
- 原始字符串总是采用该特定格式
- 你无法添加HTMLAgilityPack,
这是一种快速而肮脏的方式来获得你想要的东西:
static void Main(string[] args) { // Split original string on the 'separator' string. string originalString = "<p>LARGE</p><p>Lamb;
;li;ul; "; string[] sSeparator = new string[] { "</p><p>" }; string[] splitString = originalString.Split(sSeparator, StringSplitOptions.None); // Prepare to filter the 'prefix' and 'postscript' strings string prefix = "<p>"; string postfix = ";
;li;ul; "; int prefixLength = prefix.Length; int postfixLength = postfix.Length; // Iterate over the split string and clean up string s = string.Empty; for (int i = 0; i < splitString.Length; i++) { s = splitString[i]; if (s.Contains(prefix)) { s = s.Remove(s.IndexOf(prefix), prefixLength); } if (s.Contains(postfix)) { s = s.Remove(s.IndexOf(postfix), postfixLength); } splitString[i] = s; Console.WriteLine(splitString[i]); } Console.ReadLine(); }
// Convert < > etc. to HTML String sResult = HttpUtility.HtmlDecode(sData); // Remove HTML tags delimited by <> String result = Regex.Replace(sResult, @"enter code here<[^>]*>", String.Empty);