.Net从html页面中删除javascript和css代码块

我有html作为javascript和css代码块的字符串。

像这样的东西:

 alert('hello world');   A:link {text-decoration: none} A:visited {text-decoration: none} A:active {text-decoration: none} A:hover {text-decoration: underline; color: red;}  

但我不需要它们。 如何使用reqular表达式删除那些块?

快速’n’脏方法将是这样的正则表达式:

 var regex = new Regex( "(\\)|(\\)", RegexOptions.Singleline | RegexOptions.IgnoreCase ); string ouput = regex.Replace(input, ""); 

更好的*(但可能更慢)选项是使用HtmlAgilityPack :

 HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(htmlInput); var nodes = doc.DocumentNode.SelectNodes("//script|//style"); foreach (var node in nodes) node.ParentNode.RemoveChild(node); string htmlOutput = doc.DocumentNode.OuterHtml; 

*)有关为何更好的讨论,请参阅此主题 。

使用HTMLAgilityPack可获得更好的结果

或尝试此function

 public string RemoveScriptAndStyle(string HTML) { string Pat = "<(script|style)\\b[^>]*?>.*?"; return Regex.Replace(HTML, Pat, "", RegexOptions.IgnoreCase | RegexOptions.Singleline); } 

只需查找一个开头标签,然后删除它与closing /script>标签之间的所有内容。

同样的风格。 有关字符串操作提示, 请参阅Google

我制作了自行车)他可能没有HtmlAgilityPack那么正确,但在400kb的页面上它快了大约5-6次。 还要使符号小写并删除数字(为tokenizer制作)

  private static readonly List SPECIAL_TAGS = new List { Encoding.ASCII.GetBytes("script"), Encoding.ASCII.GetBytes("style"), Encoding.ASCII.GetBytes("noscript") }; private static readonly List SPECIAL_TAGS_CLOSE = new List { Encoding.ASCII.GetBytes("/script"), Encoding.ASCII.GetBytes("/style"), Encoding.ASCII.GetBytes("/noscript")}; public static string StripTagsCharArray(string source, bool toLowerCase) { var array = new char[source.Length]; var arrayIndex = 0; var inside = false; var haveSpecialTags = false; var compareIndex = -1; var singleQouteMode = false; var doubleQouteMode = false; var matchMemory = SetDefaultMemory(SPECIAL_TAGS); for (int i = 0; i < source.Length; i++) { var let = source[i]; if (inside && !singleQouteMode && !doubleQouteMode) { compareIndex++; if (haveSpecialTags) { var endTag = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS_CLOSE, ref matchMemory); if (endTag) haveSpecialTags = false; } if (!haveSpecialTags) { haveSpecialTags = CheckSpecialTags(let, compareIndex, SPECIAL_TAGS, ref matchMemory); } } if (haveSpecialTags && let == '"') { doubleQouteMode = !doubleQouteMode; } if (haveSpecialTags && let == '\'') { singleQouteMode = !singleQouteMode; } if (let == '<') { matchMemory = SetDefaultMemory(SPECIAL_TAGS); compareIndex = -1; inside = true; continue; } if (let == '>') { inside = false; continue; } if (inside) continue; if (char.IsDigit(let)) continue; if (haveSpecialTags) continue; array[arrayIndex] = toLowerCase ? Char.ToLowerInvariant(let) : let; arrayIndex++; } return new string(array, 0, arrayIndex); } private static bool[] SetDefaultMemory(List specialTags) { var memory = new bool[specialTags.Count]; for (int i = 0; i < memory.Length; i++) { memory[i] = true; } return memory; }