处理无效的XMLhex字符

我正在尝试通过网络发送XML文档,但收到以下exception:

"MY LONG EMAIL STRING" was specified for the 'Body' element. ---> System.ArgumentException: '', hexadecimal value 0x02, is an invalid character. at System.Xml.XmlUtf8RawTextWriter.InvalidXmlChar(Int32 ch, Byte* pDst, Boolean entitize) at System.Xml.XmlUtf8RawTextWriter.WriteElementTextBlock(Char* pSrc, Char* pSrcEnd) at System.Xml.XmlUtf8RawTextWriter.WriteString(String text) at System.Xml.XmlUtf8RawTextWriterIndent.WriteString(String text) at System.Xml.XmlRawWriter.WriteValue(String value) at System.Xml.XmlWellFormedWriter.WriteValue(String value) at Microsoft.Exchange.WebServices.Data.EwsServiceXmlWriter.WriteValue(String value, String name) --- End of inner exception stack trace --- 

我无法控制我尝试发送的内容,因为该字符串是从电子邮件中收集的。 如何编码我的字符串,以便在保留非法字符时保持有效的XML?

我想以这种或那种方式保留原始角色。

 byte[] toEncodeAsBytes = System.Text.ASCIIEncoding.ASCII.GetBytes(toEncode); string returnValue = System.Convert.ToBase64String(toEncodeAsBytes); 

是这样做的一种方式

以下代码从字符串中删除XML无效字符,并返回不带它们的新字符串:

 public static string CleanInvalidXmlChars(string text) { // From xml spec valid chars: // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]"; return Regex.Replace(text, re, ""); } 

以下解决方案删除了​​任何无效的XML字符,但它确实如此,我认为它可以完成,特别是,它不会分配新的StringBuilder和新的字符串,除非它已经确定string中包含任何无效字符。 因此,热点最终只是字符上的单个for循环,结果通常不会比每个char上的数字比较大于/小于两个。 如果没有找到,则只返回原始字符串。 当绝大多数字符串开始时都很好,这很有用,尽可能快地将它们作为输入和输出(没有浪费的alloc等)。

– 更新 –

看下面如何直接编写具有这些无效字符的XElement,尽管它使用此代码 –

其中一些代码受Tom Bogle先生的解决方案影响。 另请参阅同一个post,由superlogical在post中提供有用的信息。 但是,所有这些都始终实例化一个新的StringBuilder和字符串。

用法:

  string xmlStrBack = XML.ToValidXmlCharactersString("any string"); 

测试:

  public static void TestXmlCleanser() { string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya' string goodString = "My name is Inigo Montoya!"; string back1 = XML.ToValidXmlCharactersString(badString); // fixes it string back2 = XML.ToValidXmlCharactersString(goodString); // returns same string XElement x1 = new XElement("test", back1); XElement x2 = new XElement("test", back2); XElement x3WithBadString = new XElement("test", badString); string xml1 = x1.ToString(); string xml2 = x2.ToString().Print(); string xmlShouldFail = x3WithBadString.ToString(); } 

// — CODE —(我在一个名为XML的静态实用程序类中有这些方法)

  ///  /// Determines if any invalid XML 1.0 characters exist within the string, /// and if so it returns a new string with the invalid chars removed, else /// the same string is returned (with no wasted StringBuilder allocated, etc). ///  /// Xml string. /// The index to begin checking at. public static string ToValidXmlCharactersString(string s, int startIndex = 0) { int firstInvalidChar = IndexOfFirstInvalidXMLChar(s, startIndex); if (firstInvalidChar < 0) return s; startIndex = firstInvalidChar; int len = s.Length; var sb = new StringBuilder(len); if (startIndex > 0) sb.Append(s, 0, startIndex); for (int i = startIndex; i < len; i++) if (IsLegalXmlChar(s[i])) sb.Append(s[i]); return sb.ToString(); } ///  /// Gets the index of the first invalid XML 1.0 character in this string, else returns -1. ///  /// Xml string. /// Start index. public static int IndexOfFirstInvalidXMLChar(string s, int startIndex = 0) { if (s != null && s.Length > 0 && startIndex < s.Length) { if (startIndex < 0) startIndex = 0; int len = s.Length; for (int i = startIndex; i < len; i++) if (!IsLegalXmlChar(s[i])) return i; } return -1; } ///  /// Indicates whether a given character is valid according to the XML 1.0 spec. /// This code represents an optimized version of Tom Bogle's on SO: /// https://stackoverflow.com/a/13039301/264031. ///  public static bool IsLegalXmlChar(char c) { if (c > 31 && c <= 55295) return true; if (c < 32) return c == 9 || c == 10 || c == 13; return (c >= 57344 && c <= 65533) || c > 65535; // final comparison is useful only for integral comparison, if char c -> int c, useful for utf-32 I suppose //c <= 1114111 */ // impossible to get a code point bigger than 1114111 because Char.ConvertToUtf32 would have thrown an exception } 

======== ======== ========

直接写XElement.ToString

======== ======== ========

首先,使用此扩展方法:

 string result = xelem.ToStringIgnoreInvalidChars(); 

- 富勒测试 -

  public static void TestXmlCleanser() { string badString = "My name is Inigo Montoya"; // you may not see it, but bad char is in 'MontXoya' XElement x = new XElement("test", badString); string xml1 = x.ToStringIgnoreInvalidChars(); //result: My name is Inigo Montoya string xml2 = x.ToStringIgnoreInvalidChars(deleteInvalidChars: false); //result: My name is Inigo Montoya } 

---代码---

  ///  /// Writes this XML to string while allowing invalid XML chars to either be /// simply removed during the write process, or else encoded into entities, /// instead of having an exception occur, as the standard XmlWriter.Create /// XmlWriter does (which is the default writer used by XElement). ///  /// XElement. /// True to have any invalid chars deleted, else they will be entity encoded. /// Indent setting. /// Indent char (leave null to use default) public static string ToStringIgnoreInvalidChars(this XElement xml, bool deleteInvalidChars = true, bool indent = true, char? indentChar = null) { if (xml == null) return null; StringWriter swriter = new StringWriter(); using (XmlTextWriterIgnoreInvalidChars writer = new XmlTextWriterIgnoreInvalidChars(swriter, deleteInvalidChars)) { // -- settings -- // unfortunately writer.Settings cannot be set, is null, so we can't specify: bool newLineOnAttributes, bool omitXmlDeclaration writer.Formatting = indent ? Formatting.Indented : Formatting.None; if (indentChar != null) writer.IndentChar = (char)indentChar; // -- write -- xml.WriteTo(writer); } return swriter.ToString(); } 

- 这使用以下XmlTextWritter -

 public class XmlTextWriterIgnoreInvalidChars : XmlTextWriter { public bool DeleteInvalidChars { get; set; } public XmlTextWriterIgnoreInvalidChars(TextWriter w, bool deleteInvalidChars = true) : base(w) { DeleteInvalidChars = deleteInvalidChars; } public override void WriteString(string text) { if (text != null && DeleteInvalidChars) text = XML.ToValidXmlCharactersString(text); base.WriteString(text); } } 

我正处于@ parapurarajkumar解决方案的接收端,非法字符被正确加载到XmlDocument ,但在我尝试保存输出时破坏了XmlWriter

我的背景

我正在使用Elmah查看网站上的exception/错误日志。 Elmah以大型XML文档的forms在exception时返回服务器的状态。 对于我们的报告引擎,我使用XmlWriter打印XML。

在网站攻击期间,我注意到有些xmls没有解析并且正在接收这个'.', hexadecimal value 0x00, is an invalid character. 例外。

非分辨率:我将文档转换为byte[]并将其清理为0x00,但它没有找到。

当我扫描xml文档时,我发现了以下内容:

 ... 
... ...

将nul字节编码为html实体

解决方案:为了修复编码,我替换了 在将它加载到我的XmlDocument之前的值,因为加载它将创建nul字节,并且很难从对象中清除它。 这是我的整个过程:

 XmlDocument xml = new XmlDocument(); details.Xml = details.Xml.Replace("�", "[0x00]"); // in my case I wanted to see it, otherwise just replace with "" xml.LoadXml(details.Xml); string formattedXml = null; // I stuff this all in a helper function, but put it in-line for this example StringBuilder sb = new StringBuilder(); XmlWriterSettings settings = new XmlWriterSettings { OmitXmlDeclaration = true, Indent = true, IndentChars = "\t", NewLineHandling = NewLineHandling.None, }; using (XmlWriter writer = XmlWriter.Create(sb, settings)) { xml.Save(writer); formattedXml = sb.ToString(); } 

经验教训:如果您的传入数据在输入时进行了html编码,则使用关联的html实体清理非法字节。

为我工作:

 XmlWriterSettings xmlWriterSettings = new XmlWriterSettings { Encoding = Encoding.UTF8, CheckCharacters = false }; 

不能用以下方法清理字符串:

 System.Net.WebUtility.HtmlDecode() 

使用XmlConvert.IsXmlChar方法 (从.NET Framework 4.0开始提供)在C#中删除不正确的XML字符的另一种方法

 public static string RemoveInvalidXmlChars(string content) { return new string(content.Where(ch => System.Xml.XmlConvert.IsXmlChar(ch)).ToArray()); } 

.Net小提琴 – https://dotnetfiddle.net/v1TNus

例如,垂直制表符号(\ v)对XML无效,它是有效的UTF-8,但不是有效的XML 1.0,甚至许多库(包括libxml2)都会错过它并静默输出无效的XML。