如何使用iTextsharp在c＃.net中逐行读取带有空格（实际上）的pdf文件

我正在使用iText（for .net）来阅读pdf文件。它读取文档，但是当有空格时，它只读取一个空格。

这使得无法通过获取子串来提取数据。我想逐行读取数据空白，所以我知道文本的实际位置，因为我想将数据写入数据库。

该文件是银行对帐单，我想将其转储到数据库中以设计对帐系统，

这是一个文件的屏幕截图

以下是我正在使用的代码

For page As Integer = 1 To pdfReader.NumberOfPages ' Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy() Dim Strategy As ITextExtractionStrategy = New iTextSharp.text.pdf.parser.LocationTextExtractionStrategy() Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy) currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.[Default], Encoding.UTF8, Encoding.[Default].GetBytes(currentText))) Dim delimiterChars As Char() = {ControlChars.Lf} Dim lines As String() = currentText.Split(delimiterChars) Dim Bnk_Name As Boolean = True Dim Br_Name As Boolean = False Dim Name_acc As Boolean = False Dim statment As Boolean = False Dim Curr As Boolean = False Dim Open As Boolean = False Dim BankName = "" Dim Branch = "" Dim AccountNo = "" Dim CompName = "" Dim Currency = "" Dim Statement_from = "" Dim Statement_to = "" Dim Opening_Balance = "" Dim Closing_Balance = "" Dim Narration As String = "" For Each line As String In lines line.Trim() 'BANK NAME If Bnk_Name Then If line.Trim()  "" Then BankName = line.Substring(0, 21) Bnk_Name = False Else Bnk_Name = False End If End If

此Pic显示了代码读取文件的示例

但我想要读取位置的空白

（没有看到你的PDF，这个解释是我能想到的最好的。）

您的文档不包含任何空格。也就是说，文档的内容流不包含空格。相反，渲染字符的指令只考虑了需要存在的空间。

在这种情况下，iText必须“猜测”空格所在的位置。并且每当两个字符比正在使用的字体的空白字符的宽度更远时，它将估计插入1个空格。

可能这就是出错的地方。

同样重要的是，您不应该使用文本位置来提取数据。这种方法太容易出错。

尝试使用正则表达式结合更好的ITextExtractionStrategy。 ITextExtractionStrategy有一个实现，允许您指定一个Rectangle。如果您这样做，您可以更精确地从文档中获取内容。

由于您正在处理银行对帐单，因此使用基于矩形的搜索和正则表达式的组合 （例如，查找与银行帐号匹配的内容）应该可以轻松提取内容

您使用LocationTextExtractionStrategy 。正如@Joris已经回答的那样，这种策略最多只能为水平间隙添加一个空格字符。另一方面，您希望每个间隙有一定量的空白，这使得结果表示PDF中文本行的水平布局。

在这个答案中，我曾经概述了如何构建这样的文本提取策略。由于答案是针对iText / Java而且从那时起， LocationTextExtractionStrategy已经发生了很大的变化，但我并不认为当前的问题是重复的。

使用reflection代替类复制，从旧答案到当前iTextSharp LocationTextExtractionStrategy AC＃适应性思路将如下所示：

 class LayoutTextExtractionStrategy : LocationTextExtractionStrategy { public LayoutTextExtractionStrategy(float fixedCharWidth) { this.fixedCharWidth = fixedCharWidth; } MethodInfo DumpStateMethod = typeof(LocationTextExtractionStrategy).GetMethod("DumpState", BindingFlags.NonPublic | BindingFlags.Instance); MethodInfo FilterTextChunksMethod = typeof(LocationTextExtractionStrategy).GetMethod("filterTextChunks", BindingFlags.NonPublic | BindingFlags.Instance); FieldInfo LocationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", BindingFlags.NonPublic | BindingFlags.Instance); public override string GetResultantText(ITextChunkFilter chunkFilter) { if (DUMP_STATE) { //DumpState(); DumpStateMethod.Invoke(this, null); } // List filteredTextChunks = filterTextChunks(locationalResult, chunkFilter); object locationalResult = LocationalResultField.GetValue(this); List filteredTextChunks = (List)FilterTextChunksMethod.Invoke(this, new object[] { locationalResult, chunkFilter }); filteredTextChunks.Sort(); int startOfLinePosition = 0; StringBuilder sb = new StringBuilder(); TextChunk lastChunk = null; foreach (TextChunk chunk in filteredTextChunks) { if (lastChunk == null) { InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false); sb.Append(chunk.Text); } else { if (chunk.SameLine(lastChunk)) { // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space if (IsChunkAtWordBoundary(chunk, lastChunk)/* && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)*/) { //sb.Append(' '); InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)); } sb.Append(chunk.Text); } else { sb.Append('\n'); startOfLinePosition = sb.Length; InsertSpaces(sb, startOfLinePosition, chunk.Location.DistParallelStart, false); sb.Append(chunk.Text); } } lastChunk = chunk; } return sb.ToString(); } private bool StartsWithSpace(String str) { if (str.Length == 0) return false; return str[0] == ' '; } private bool EndsWithSpace(String str) { if (str.Length == 0) return false; return str[str.Length - 1] == ' '; } void InsertSpaces(StringBuilder sb, int startOfLinePosition, float chunkStart, bool spaceRequired) { int indexNow = sb.Length - startOfLinePosition; int indexToBe = (int)((chunkStart - pageLeft) / fixedCharWidth); int spacesToInsert = indexToBe - indexNow; if (spacesToInsert < 1 && spaceRequired) spacesToInsert = 1; for (; spacesToInsert > 0; spacesToInsert--) { sb.Append(' '); } } public float pageLeft = 0; public float fixedCharWidth = 6; }

如您所见，它需要一个float构造函数参数fixedCharWidth 。此参数表示结果字符串中的字符应对应的PDF页面上的宽度。它以PDF默认用户空间单位给出（这样的单位通常为_1/72英寸）。在目录PDF的情况下，上述问题是关于（非常小的字体大小），值为3是合适的; 对于大多数使用较大尺寸字体的常见PDF，值6似乎是合适的。

如何使用iTextsharp在c＃.net中逐行读取带有空格（实际上）的pdf文件

使用比原始元素更多的元素创建ReactiveUI派生集合

下划线在C＃中的数字文字中意味着什么？

Json.NET如何在反序列化期间执行dependency injection？

nhibernate 3.3一对多映射代码更新子代而不是插入

C＃：根据平台访问32位/ 64位DLL

如何解析C＃中的JSON数组值（Windows Phone 7）？

ListView AutoResizeColumns基于列内容和标题

注册后台任务而不运行应用程序

XML：如何删除所有没有属性的节点或子元素

拖动时TreeView自动滚动