C#hex值0x12,是无效字符

我正在加载大量的xml文档,其中一些返回错误,如“hex值0x12,是一个无效字符”,并且有不同的字符。 如何删除它们?

我在这里做了一个小研究。

这是ASCII表。 有128个符号 ASCII表 这是一些小的测试代码,它添加了ASCII表中的每个符号,并尝试将其作为XML文档加载。

static public void RegexTry() { StreamReader stream = new StreamReader(@"test.xml"); string xmlfile = stream.ReadToEnd(); stream.Close(); string text = ""; for (int i = 0; i < 128; i++ ) { char t = (char) i; text = xmlfile.Replace('П', t); XmlDocument xml = new XmlDocument(); try { xml.LoadXml(text); } catch (Exception ex) { Console.WriteLine("Char("+i.ToString() +"): " + t + " => error! " + ex.Message); continue; } Console.WriteLine("Char(" + i.ToString() + "): " + t + " => fine!"); } Console.ReadKey(); } 

结果它返回:

 Char(0): => error! '.', hexadecimal value 0x00, is an invalid character. Line 5, position 7. Char(1): => error! '', hexadecimal value 0x01, is an invalid character. Line 5, position 7. Char(2): => error! '', hexadecimal value 0x02, is an invalid character. Line 5, position 7. Char(3): => error! '', hexadecimal value 0x03, is an invalid character. Line 5, position 7. Char(4): => error! '', hexadecimal value 0x04, is an invalid character. Line 5, position 7. Char(5): => error! '', hexadecimal value 0x05, is an invalid character. Line 5, position 7. Char(6): => error! '', hexadecimal value 0x06, is an invalid character. Line 5, position 7. Char(7): => error! '', hexadecimal value 0x07, is an invalid character. Line 5, position 7. Char(8): => error! '', hexadecimal value 0x08, is an invalid character. Line 5, position 7. Char(9): => fine! Char(10): => fine! Char(11): => error! '', hexadecimal value 0x0B, is an invalid character. Line 5, position 7. Char(12): => error! '', hexadecimal value 0x0C, is an invalid character. Line 5, position 7. Char(13): => fine! Char(14): => error! '', hexadecimal value 0x0E, is an invalid character. Line 5, position 7. Char(15): => error! '', hexadecimal value 0x0F, is an invalid character. Line 5, position 7. Char(16): => error! '', hexadecimal value 0x10, is an invalid character. Line 5, position 7. Char(17): => error! '', hexadecimal value 0x11, is an invalid character. Line 5, position 7. Char(18): => error! '', hexadecimal value 0x12, is an invalid character. Line 5, position 7. Char(19): => error! '', hexadecimal value 0x13, is an invalid character. Line 5, position 7. Char(20): => error! '', hexadecimal value 0x14, is an invalid character. Line 5, position 7. Char(21): => error! '', hexadecimal value 0x15, is an invalid character. Line 5, position 7. Char(22): => error! '', hexadecimal value 0x16, is an invalid character. Line 5, position 7. Char(23): => error! '', hexadecimal value 0x17, is an invalid character. Line 5, position 7. Char(24): => error! '', hexadecimal value 0x18, is an invalid character. Line 5, position 7. Char(25): => error! '', hexadecimal value 0x19, is an invalid character. Line 5, position 7. Char(26): => error! '', hexadecimal value 0x1A, is an invalid character. Line 5, position 7. Char(27): => error! '', hexadecimal value 0x1B, is an invalid character. Line 5, position 7. Char(28): => error! '', hexadecimal value 0x1C, is an invalid character. Line 5, position 7. Char(29): => error! '', hexadecimal value 0x1D, is an invalid character. Line 5, position 7. Char(30): => error! '', hexadecimal value 0x1E, is an invalid character. Line 5, position 7. Char(31): => error! '', hexadecimal value 0x1F, is an invalid character. Line 5, position 7. Char(32): => fine! Char(33): ! => fine! Char(34): " => fine! Char(35): # => fine! Char(36): $ => fine! Char(37): % => fine! Char(38): => error! An error occurred while parsing EntityName. Line 5, position 8. Char(39): ' => fine! Char(40): ( => fine! Char(41): ) => fine! Char(42): * => fine! Char(43): + => fine! Char(44): , => fine! Char(45): - => fine! Char(46): . => fine! Char(47): / => fine! Char(48): 0 => fine! Char(49): 1 => fine! Char(50): 2 => fine! Char(51): 3 => fine! Char(52): 4 => fine! Char(53): 5 => fine! Char(54): 6 => fine! Char(55): 7 => fine! Char(56): 8 => fine! Char(57): 9 => fine! Char(58): : => fine! Char(59): ; => fine! Char(60): => error! The '<' character, hexadecimal value 0x3C, cannot be included in a name. Line 5, position 13. Char(61): = => fine! Char(62): > => fine! Char(63): ? => fine! Char(64): @ => fine! Char(65): A => fine! Char(66): B => fine! Char(67): C => fine! Char(68): D => fine! Char(69): E => fine! Char(70): F => fine! Char(71): G => fine! Char(72): H => fine! Char(73): I => fine! Char(74): J => fine! Char(75): K => fine! Char(76): L => fine! Char(77): M => fine! Char(78): N => fine! Char(79): O => fine! Char(80): P => fine! Char(81): Q => fine! Char(82): R => fine! Char(83): S => fine! Char(84): T => fine! Char(85): U => fine! Char(86): V => fine! Char(87): W => fine! Char(88): X => fine! Char(89): Y => fine! Char(90): Z => fine! Char(91): [ => fine! Char(92): \ => fine! Char(93): ] => fine! Char(94): ^ => fine! Char(95): _ => fine! Char(96): ` => fine! Char(97): a => fine! Char(98): b => fine! Char(99): c => fine! Char(100): d => fine! Char(101): e => fine! Char(102): f => fine! Char(103): g => fine! Char(104): h => fine! Char(105): i => fine! Char(106): j => fine! Char(107): k => fine! Char(108): l => fine! Char(109): m => fine! Char(110): n => fine! Char(111): o => fine! Char(112): p => fine! Char(113): q => fine! Char(114): r => fine! Char(115): s => fine! Char(116): t => fine! Char(117): u => fine! Char(118): v => fine! Char(119): w => fine! Char(120): x => fine! Char(121): y => fine! Char(122): z => fine! Char(123): { => fine! Char(124): | => fine! Char(125): } => fine! Char(126): ~ => fine! Char(127): => fine! 

您可以看到有许多符号不能用于XML代码。 要替换它们,我们可以使用Reqex.Replace

 static string ReplaceHexadecimalSymbols(string txt) { string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]"; return Regex.Replace(txt, r,"",RegexOptions.Compiled); } 

PS。 对不起,如果每个人都知道。

XML规范定义了这样的有效字符:

 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] 

如您所见, #x12不是XML文档中的有效字符。

你问如何删除它们,但我认为这不是你应该问的问题。 他们应该根本不在场。 你应该拒绝任何不正确的文件。 简单地删除无效字符可以抑制真正的问题。

如果要创建有问题的文档,则需要修复生成它的代码,以便生成有效的XML。

我认为x26“&”是一个有效的字符,它可以通过XML反序列化。

所以要替换非法字符,我们应该使用:

 // Replace illegal character in XML documents with blank // See here for reference http://www.w3.org/TR/xml/#charsets var regex = "[\x00-\x08\x0B\x0C\x0E-\x1F]"; xml = Regex.Replace(xml, r, String.Empty, RegexOptions.Compiled); 

这基本上是这个问题的一个特例。 我建议你使用其中一个答案。

只需使用jhon提供的上述修复程序更新这些函数,并在必须更新代码中检查这些函数。 它会对你有用我测试过。

  private static void WriteDataTableToExcelWorksheet(DataTable dt, WorksheetPart worksheetPart) { var worksheet = worksheetPart.Worksheet; var sheetData = worksheet.GetFirstChild(); string cellValue = ""; // Create a Header Row in our Excel file, containing one header for each Column of data in our DataTable. // // We'll also create an array, showing which type each column of data is (Text or Numeric), so when we come to write the actual // cells of data, we'll know if to write Text values or Numeric cell values. int numberOfColumns = dt.Columns.Count; bool[] IsNumericColumn = new bool[numberOfColumns]; string[] excelColumnNames = new string[numberOfColumns]; for (int n = 0; n < numberOfColumns; n++) excelColumnNames[n] = GetExcelColumnName(n); // // Create the Header row in our Excel Worksheet // uint rowIndex = 1; var headerRow = new Row { RowIndex = rowIndex }; // add a row at the top of spreadsheet sheetData.Append(headerRow); for (int colInx = 0; colInx < numberOfColumns; colInx++) { DataColumn col = dt.Columns[colInx]; AppendTextCell(excelColumnNames[colInx] + "1", col.ColumnName, headerRow); IsNumericColumn[colInx] = (col.DataType.FullName == "System.Decimal") || (col.DataType.FullName == "System.Int32"); } // // Now, step through each row of data in our DataTable... // double cellNumericValue = 0; foreach (DataRow dr in dt.Rows) { // ...create a new row, and append a set of this row's data to it. ++rowIndex; var newExcelRow = new Row { RowIndex = rowIndex }; // add a row at the top of spreadsheet sheetData.Append(newExcelRow); for (int colInx = 0; colInx < numberOfColumns; colInx++) { cellValue = dr.ItemArray[colInx].ToString(); // Create cell with data if (IsNumericColumn[colInx]) { // For numeric cells, make sure our input data IS a number, then write it out to the Excel file. // If this numeric value is NULL, then don't write anything to the Excel file. cellNumericValue = 0; if (double.TryParse(cellValue, out cellNumericValue)) { cellValue = ReplaceHexadecimalSymbols(cellNumericValue.ToString()); AppendNumericCell(excelColumnNames[colInx] + rowIndex.ToString(), cellValue, newExcelRow); } } else { // For text cells, just write the input data straight out to the Excel file. AppendTextCell(excelColumnNames[colInx] + rowIndex.ToString(), cellValue, newExcelRow); } } } } static string ReplaceHexadecimalSymbols(string txt) { string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]"; return Regex.Replace(txt, r, "", RegexOptions.Compiled); } private static void AppendTextCell(string cellReference, string cellStringValue, Row excelRow) { // Add a new Excel Cell to our Row Cell cell = new Cell() { CellReference = cellReference, DataType = CellValues.String }; CellValue cellValue = new CellValue(); cellValue.Text = ReplaceHexadecimalSymbols(cellStringValue); cell.Append(cellValue); excelRow.Append(cell); } private static void AppendNumericCell(string cellReference, string cellStringValue, Row excelRow) { // Add a new Excel Cell to our Row Cell cell = new Cell() { CellReference = cellReference }; CellValue cellValue = new CellValue(); cellValue.Text = ReplaceHexadecimalSymbols(cellStringValue); cell.Append(cellValue); excelRow.Append(cell); } 

如果你需要进一步的帮助,请告诉我。

即使在100MB XML文档上,Regex解决方案的工作速度也非常快。

以下表达式字符串可以完成工作。

 "[\x00-\x08\x0B\x0C\x0E-\x1F]"