在C＃中解析MS Word生成的XML文件

所以我有一个客户端（这可能只来自政府），他们有一堆他们想要输入数据库的MS Word文档，而且没有人工输入，我觉得将它们转换为XML并使用实用程序解析它们会是最好的行动方案。

我有一个实用程序，使用stackoverflow上的代码执行此操作：

Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application(); object oMissing = System.Reflection.Missing.Value; DirectoryInfo dirInfo = new DirectoryInfo(Server.MapPath("\\testfiles")); FileInfo[] wordFiles = dirInfo.GetFiles("*.doc"); word.Visible = false; word.ScreenUpdating = false; XmlDocument xmlDoc = new XmlDocument(); foreach(FileInfo wordFile in wordFiles) { Object filename = (Object)wordFile.FullName; Document doc = word.Documents.Open(ref filename, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing); doc.Activate(); object outputFileName = wordFile.FullName.Replace(".doc", ".xml"); object fileFormat = WdSaveFormat.wdFormatXML; doc.SaveAs(ref outputFileName, ref fileFormat, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing, ref oMissing); object saveChanges = WdSaveOptions.wdDoNotSaveChanges; ((_Document)doc).Close(ref saveChanges, ref oMissing, ref oMissing); doc = null; xmlDoc.Load(outputFileName.ToString()); XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDoc.NameTable); nsmgr.AddNamespace("w", "http://schemas.microsoft.com/office/word/2003/wordml"); XmlNodeList node = xmlDoc.SelectNodes("//w:document/descendant::w:t|//w:document/descendant::w:p|//w:document/descendant::w:tab", nsmgr); } ((_Application)word).Quit(ref oMissing, ref oMissing, ref oMissing); word = null;

现在，我的XML文件看起来像这样：

      ...   ...   ...   ...  ... ...                      blah blah blach this is sample text           More sample text         Sample Header            Sample Body text.......

我不是专业人士，但我认为通过正确地声明命名空间管理器，我在这里很好地遵循了法律的字母，那么为什么我在尝试选择的节点上获得空返回？

 XmlNodeList node = xmlDoc.SelectNodes("//w:document/descendant::w:t|//w:document/descendant::w:p|//w:document/descendant::w:tab", nsmgr);

我错过了什么吗？

我看起来你的XPath表达式中有错误的节点名称。用w:wordDocument替换所有出现的w:document 。所以它应该是：

 XmlNodeList node = xmlDoc.SelectNodes("//w:wordDocument/descendant::w:t|//w:wordDocument/descendant::w:p|//w:wordDocument/descendant::w:tab", nsmgr);

在C＃中解析MS Word生成的XML文件

如何在TextChanged中获取新文本？

C＃Linq – 无法将IEnumerable 隐式转换为List

如何在IIS 7.0中托管MVC应用程序？

C＃WPF从exe文件夹中加载图像

重定向stdin和stdout，其中stdin首先关闭

很长一段时间后SqlDependency错误

如何从流中删除转义序列

将对象保存到没有序列化属性的文件的最简单方法是什么？

如何解密AES-256-CBC加密字符串

关于FirstOrDefault或SingleOrDefault