如何按页码访问OpenXML内容？

使用OpenXML，我可以按页码阅读文档内容吗？

wordDocument.MainDocumentPart.Document.Body提供完整文档的内容。

  public void OpenWordprocessingDocumentReadonly() { string filepath = @"C:\...\test.docx"; // Open a WordprocessingDocument based on a filepath. using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filepath, false)) { // Assign a reference to the existing document body. Body body = wordDocument.MainDocumentPart.Document.Body; int pageCount = 0; if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null) { pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text); } for (int i = 1; i <= pageCount; i++) { //Read the content by page number } } }

MSDN 参考

更新1：

它看起来像分页符设置如下

所以现在我需要使用上面的检查拆分XML并为每个检查使用InnerTex ，这将为我提供页面文本。

现在问题变成如何用上面的检查拆分XML？

更新2：

仅当您有分页符时才设置分页符，但如果文本从一个页面浮动到其他页面，则没有设置分页符XML元素，因此它将恢复到相同的挑战如何识别页面分隔。

您不能仅通过 OOXML数据级别的页码编号来引用OOXML内容 。

硬分页不是问题; 可以计算硬分页数。
软分页是问题所在。这些是根据依赖于实现的换行符和分页算法计算的; 它不是OOXML数据的固有特征。没有什么好算的。

那么w:lastRenderedPageBreak ，它是上次呈现文档时软分页w:lastRenderedPageBreak位置的记录？ 不， w:lastRenderedPageBreak一般没有帮助，因为 ：

根据定义， w:lastRenderedPageBreak位置在自上次由分页其内容的程序打开后更改内容时是陈旧的。
在MS Word的实现中， w:lastRenderedPageBreak在各种情况下都是不可靠的，包括
1. 当表跨越两页时
2. 当下一页以空段开头时
3. 对于多列布局，文本框开始新列
4. 对于大图像或长序列的空白行

如果您愿意接受对Word Automation的依赖，以及其固有的许可和服务器操作限制，那么您有机会确定页面边界，页面编号，页数等。

否则， 唯一真正的答案是超越基于页面的引用框架，这些框架依赖于专有的，特定于实现的分页算法。

这就是我最终做到的方式。

  public void OpenWordprocessingDocumentReadonly() { string filepath = @"C:\...\test.docx"; // Open a WordprocessingDocument based on a filepath. Dictionary pageviseContent = new Dictionary(); int pageCount = 0; using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filepath, false)) { // Assign a reference to the existing document body. Body body = wordDocument.MainDocumentPart.Document.Body; if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null) { pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text); } int i = 1; StringBuilder pageContentBuilder = new StringBuilder(); foreach (var element in body.ChildElements) { if (element.InnerXml.IndexOf("", StringComparison.OrdinalIgnoreCase) < 0) { pageContentBuilder.Append(element.InnerText); } else { pageviseContent.Add(i, pageContentBuilder.ToString()); i++; pageContentBuilder = new StringBuilder(); } if (body.LastChild == element && pageContentBuilder.Length > 0) { pageviseContent.Add(i, pageContentBuilder.ToString()); } } } }

缺点：这在所有情况下都不适用。 这仅在您有分页符时才有效，但如果您将文本从第1页扩展到第2页，则没有标识符可以知道您在第二页。

List Allparagraphs = wp.MainDocumentPart.Document.Body.OfType （）。ToList（）;

List PageParagraphs = Allparagraphs.Where（x => x.Descendants （）。Count（）== 1）.Select（x => x）.Distinct（）。ToList（）;

如何按页码访问OpenXML内容？

测量单位转换

从.NET调用非托管代码

如何使用FluentScheduler库在C＃中安排任务？

Backgroundworker中止

“无法加载文件或程序集’XXX.YYY’或其依赖项之一。该系统找不到指定的文件。”

EF和TPT：在SET子句中多次指定列名

按entity framework中的ID获取元素列表

基于MetadataType的ASP.Net C＃validation模型

为什么等待不等？

安装应用程序后，拖放不再起作用

如何按页码访问OpenXML内容？

测量单位转换

从.NET调用非托管代码

如何使用FluentScheduler库在C＃中安排任务？

Backgroundworker中止

“无法加载文件或程序集’XXX.YYY’或其依赖项之一。 该系统找不到指定的文件。”

EF和TPT：在SET子句中多次指定列名

按entity framework中的ID获取元素列表

基于MetadataType的ASP.Net C＃validation模型

为什么等待不等？

安装应用程序后，拖放不再起作用

“无法加载文件或程序集’XXX.YYY’或其依赖项之一。该系统找不到指定的文件。”