在C#中解析MS Word生成的XML文件

所以我有一个客户端(这可能只来自政府),他们有一堆他们想要输入数据库的MS Word文档,而且没有人工输入,我觉得将它们转换为XML并使用实用程序解析它们会是最好的行动方案。

我有一个实用程序,使用stackoverflow上的代码执行此操作:

Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application(); object oMissing = System.Reflection.Missing.Value; DirectoryInfo dirInfo = new DirectoryInfo(Server.MapPath("\\testfiles")); FileInfo[] wordFiles = dirInfo.GetFiles("*.doc"); word.Visible = false; word.ScreenUpdating = false; XmlDocument xmlDoc = new XmlDocument(); foreach(FileInfo wordFile in wordFiles) { Object filename = (Object)wordFile.FullName; Document doc = word.Documents.Open(ref filename, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing); doc.Activate(); object outputFileName = wordFile.FullName.Replace(".doc", ".xml"); object fileFormat = WdSaveFormat.wdFormatXML; doc.SaveAs(ref outputFileName, ref fileFormat, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing); object saveChanges = WdSaveOptions.wdDoNotSaveChanges; ((_Document)doc).Close(ref saveChanges, ref oMissing, ref oMissing); doc = null; xmlDoc.Load(outputFileName.ToString()); XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDoc.NameTable); nsmgr.AddNamespace("w", "http://schemas.microsoft.com/office/word/2003/wordml"); XmlNodeList node = xmlDoc.SelectNodes("//w:document/descendant::w:t|//w:document/descendant::w:p|//w:document/descendant::w:tab", nsmgr); } ((_Application)word).Quit(ref oMissing, ref oMissing, ref oMissing); word = null; 

现在,我的XML文件看起来像这样:

      ...   ...   ...   ...  ... ...                      blah blah blach this is sample text           More sample text         Sample Header            Sample Body text.......     

我不是专业人士,但我认为通过正确地声明命名空间管理器,我在这里很好地遵循了法律的字母,那么为什么我在尝试选择的节点上获得空返回?

 XmlNodeList node = xmlDoc.SelectNodes("//w:document/descendant::w:t|//w:document/descendant::w:p|//w:document/descendant::w:tab", nsmgr); 

我错过了什么吗?

我看起来你的XPath表达式中有错误的节点名称。 用w:wordDocument替换所有出现的w:document 。 所以它应该是:

 XmlNodeList node = xmlDoc.SelectNodes("//w:wordDocument/descendant::w:t|//w:wordDocument/descendant::w:p|//w:wordDocument/descendant::w:tab", nsmgr);