Itextsharp文本提取

我在vb.net上使用itextsharp从pdf文件中获取文本内容。该解决方案适用于某些文件，但不适用于其他甚至非常简单的文件。问题是令牌字符串值设置为null（一组空方框）

token = New iTextSharp.text.pdf.PRTokeniser(pageBytes) While token.NextToken() tknType = token.TokenType() tknValue = token.StringValue

我可以确保内容的长度，但我无法获得实际的字符串内容。

我意识到这取决于pdf的字体。如果我使用Acrobat或PdfCreator和Courier创建一个pdf（顺便说一句，这是我的visual studio编辑器中的默认字体），我可以获得所有文本内容。如果使用不同的字体构建相同的pdf，我得到空的方框。

现在的问题是，无论字体设置如何，我如何提取文本？

谢谢

补充Mark的答案对我有很大帮助.iTextSharp实现名称空间和类与java版本有点不同

  public static string GetTextFromAllPages(String pdfPath) { PdfReader reader = new PdfReader(pdfPath); StringWriter output = new StringWriter(); for (int i = 1; i <= reader.NumberOfPages; i++) output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())); return output.ToString(); }

看看PdfTextExtractor 。

 String pageText = PdfTextExtractor.getTextFromPage(myReader, pageNum);

要么

 String pageText = PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());

两者都需要相当新版本的iText [夏普]。实际上，自己解析内容流只是重新发明轮子。免除一些痛苦，让iText为您做好准备。

PdfTextExtractor将为您处理所有不同的字体/编码问题…所有可以处理的问题。如果您无法准确地从Reader复制/粘贴，则PDF中没有足够的信息来从内容流中获取字符信息。

以下是iTextSharp.text.pdf.PdfName.ANNOTS和iTextSharp.text.pdf.PdfName.CONTENT的变体，如果有人需要它。

  string strFile = @"C:\my\path\tothefile.pdf"; iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile); iTextSharp.text.pdf.PRTokeniser prtTokeneiser; int pageFrom = 1; int pageTo = pdfRida.NumberOfPages; iTextSharp.text.pdf.PRTokeniser.TokType tkntype ; string tknValue; for (int i = pageFrom; i <= pageTo; i++) { iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i); iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS); if(cannots!=null) foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList) { iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number); iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS); if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString)) { string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString(); if (cStringVal.ToUpper().Contains("LOS 8")) { // DO SOMETHING FUN } } } } pdfRida.Close();

Itextsharp文本提取

OpenFileDialog – 仅显示没有扩展名的文件名

exception与validation

属性类不调用构造函数

向TextBox添加新的数据行

ASP.NET MVC3 Razor在模型中查询（foreach中的foreach）

生成一系列随机数，在c＃中加起来为N

使用动态发射的POCO进行快速序列化和反序列化

为什么我应该在属性访问器中使用私有变量？

Autofac – 带有OWIN的SignalR。获取对ContainerBuilder的引用

Caliburn Message.Attach（）抛出“找不到方法的目标”

Itextsharp文本提取

OpenFileDialog – 仅显示没有扩展名的文件名

exception与validation

属性类不调用构造函数

向TextBox添加新的数据行

ASP.NET MVC3 Razor在模型中查询（foreach中的foreach）

生成一系列随机数，在c＃中加起来为N

使用动态发射的POCO进行快速序列化和反序列化

为什么我应该在属性访问器中使用私有变量？

Autofac – 带有OWIN的SignalR。 获取对ContainerBuilder的引用

Caliburn Message.Attach（）抛出“找不到方法的目标”

Autofac – 带有OWIN的SignalR。获取对ContainerBuilder的引用