iTextSharp – 如何在页面上获取单词的位置

我正在使用iTextSharp和reader.GetPageContent方法从PDF中提取文本。我需要找到文档中找到的每个单词的矩形/位置。有没有办法使用iTextSharp获取PDF中单词的矩形/位置？

就在这里。查看text.pdf.parser包，特别是LocationTextExtractionStrategy 。实际上，这可能也不行。你可能想编写自己的TextExtractionStrategy来输入PdfTextExtractor：

 MyTexExStrat strat = new MyTexExStrat(); PdfTextExtractor.getTextFromPage(reader, pageNum, strat); // get the strings-n-rects from strat. public class MyTexExStrat implements TextExtractionStrategy { void beginTextBlock() {} void endTextBlock() {} void renderImage(ImageRenderInfo info) {} void renderText(TextRenderInfo info) { // track text and location here. } }

您可能希望查看LocationTextExtractionStrategy的源代码，以了解它如何组合共享基线的文本。您甚至可以修改LTES来存储字符串和rects的并行数组。

PS：要构建rects，你可以获得AscentLine和DescentLine并使用这些坐标作为顶角和底角：

 Vector bottomLeft = info.getDescentLine().getStartPoint(); Vector topRight = info.getAscentLine().getEndPoint(); Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1), bottomLeft.get(Vector.I2), topRight.get(Vector.I1), topRight.get(Vector.I2));

警告：上面的代码说明文本是水平的并且从左到右进行。旋转文本会将其搞砸，垂直文本或从右到左（阿拉伯语，希伯来语）文本也是如此。对于大多数应用程序，上面应该没问题，但知道它的限制。

好狩猎。

Interesting Posts

c #windows表单Tab顺序

使用ref节省内存传递引用类型吗？

在C＃中存储应用程序设置

活动目录 – 用户的角色

WPF中UserControl中DesignWidth和Width之间的差异

语言学习语音识别工具

在asp.net-mvc中，使用Base ViewModel在Site.Master页面上显示动态内容的最佳方法是什么

重定向stdin和stdout，其中stdin首先关闭

在二进制文件的中间插入字节

检查形状之间碰撞的设计模式