如何使用iTextSharp从PDF中正确提取下标/上标?

iTextSharp很好地从PDF文档中提取纯文本,但是我遇到了下标/上标文本的问题,这在技术文档中很常见。

TextChunk.SameLine()需要两个块具有相同的垂直定位才能“在”同一行上,而上标或下标文本不是这种情况。 例如,在本文档的第11页上,在“燃烧效率”下:

http://www.mass.gov/courts/docs/lawlib/300-399cmr/310cmr7.pdf

预期案文:

 monoxide (CO) in flue gas in accordance with the following formula: CE = [CO2 /(CO + CO2)] 

结果文字:

 monoxide (CO) in flue gas in accordance with the following formula: CE = [CO /(CO + CO )] 2 2 

我将SameLine()移动到LocationTextExtractionStrategy并为它读取的私有TextChunk属性创建了公共getter。 这允许我在我自己的子类中动态调整容差,如下所示:

 public class SubSuperStrategy : LocationTextExtractionStrategy { public int SameLineOrientationTolerance { get; set; } public int SameLineDistanceTolerance { get; set; } public override bool SameLine(TextChunk chunk1, TextChunk chunk2) { var orientationDelta = Math.Abs(chunk1.OrientationMagnitude - chunk2.OrientationMagnitude); if(orientationDelta > SameLineOrientationTolerance) return false; var distDelta = Math.Abs(chunk1.DistPerpendicular - chunk2.DistPerpendicular); return (distDelta <= SameLineDistanceTolerance); } } 

使用SameLineDistanceTolerance3 ,这将更正子/超级块分配给哪一 ,但文本的相对位置是关闭的:

 monoxide (CO) in flue gas in accordance with the following formula: CE = [CO /(CO + CO )] 2 2 

有时块会插入文本中间的某个位置,有时(如本示例所示)最后插入。 无论哪种方式,他们都不会在正确的地方结束。 我怀疑这可能与字体大小有关,但我很难理解这段代码的大小。

有没有人找到另一种方法来解决这个问题?

(如果有帮助,我很乐意提交我的更改请求。)

要正确地提取这些下标和上标,需要一种不同的方法来检查两个文本块是否在同一行。 以下类代表一种这样的方法。

我更喜欢Java / iText; 因此,我首先在Java中实现了这种方法,然后才将其转换为C#/ iTextSharp。

使用Java和iText的方法

我正在使用当前开发分支iText 5.5.8-SNAPSHOT。

一种识别线条的方法

假设文本行是水平的并且不同行上的字形的边界框的垂直延伸不重叠,可以尝试使用RenderListener来识别行,如下所示:

 public class TextLineFinder implements RenderListener { @Override public void beginTextBlock() { } @Override public void endTextBlock() { } @Override public void renderImage(ImageRenderInfo renderInfo) { } /* * @see RenderListener#renderText(TextRenderInfo) */ @Override public void renderText(TextRenderInfo renderInfo) { LineSegment ascentLine = renderInfo.getAscentLine(); LineSegment descentLine = renderInfo.getDescentLine(); float[] yCoords = new float[]{ ascentLine.getStartPoint().get(Vector.I2), ascentLine.getEndPoint().get(Vector.I2), descentLine.getStartPoint().get(Vector.I2), descentLine.getEndPoint().get(Vector.I2) }; Arrays.sort(yCoords); addVerticalUseSection(yCoords[0], yCoords[3]); } /** * This method marks the given interval as used. */ void addVerticalUseSection(float from, float to) { if (to < from) { float temp = to; to = from; from = temp; } int i=0, j=0; for (; i i) verticalFlips.remove(j); if (toOutsideInterval) verticalFlips.add(i, to); if (fromOutsideInterval) verticalFlips.add(i, from); } final List verticalFlips = new ArrayList(); } 

( TextLineFinder.java )

RenderListener尝试通过将文本边界框投影到y轴上来识别水平文本行。 它假设这些投影对于来自不同行的文本不重叠,即使在下标和上标的情况下也是如此。

这个类本质上是本答案中使用的PageVerticalAnalyzer的简化forms。

按这些行对文本块进行排序

确定了上面的行后,可以调整iText的LocationTextExtractionStrategy来排序这样的行:

 public class HorizontalTextExtractionStrategy extends LocationTextExtractionStrategy { public class HorizontalTextChunk extends TextChunk { public HorizontalTextChunk(String string, Vector startLocation, Vector endLocation, float charSpaceWidth) { super(string, startLocation, endLocation, charSpaceWidth); } @Override public int compareTo(TextChunk rhs) { if (rhs instanceof HorizontalTextChunk) { HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs; int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber()); if (rslt != 0) return rslt; return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1)); } else return super.compareTo(rhs); } @Override public boolean sameLine(TextChunk as) { if (as instanceof HorizontalTextChunk) { HorizontalTextChunk horAs = (HorizontalTextChunk) as; return getLineNumber() == horAs.getLineNumber(); } else return super.sameLine(as); } public int getLineNumber() { Vector startLocation = getStartLocation(); float y = startLocation.get(Vector.I2); List flips = textLineFinder.verticalFlips; if (flips == null || flips.isEmpty()) return 0; if (y < flips.get(0)) return flips.size() / 2 + 1; for (int i = 1; i < flips.size(); i+=2) { if (y < flips.get(i)) { return (1 + flips.size() - i) / 2; } } return 0; } } @Override public void renderText(TextRenderInfo renderInfo) { textLineFinder.renderText(renderInfo); LineSegment segment = renderInfo.getBaseline(); if (renderInfo.getRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise()); segment = segment.transformBy(riseOffsetTransform); } TextChunk location = new HorizontalTextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(), renderInfo.getSingleSpaceWidth()); getLocationalResult().add(location); } public HorizontalTextExtractionStrategy() throws NoSuchFieldException, SecurityException { locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult"); locationalResultField.setAccessible(true); textLineFinder = new TextLineFinder(); } @SuppressWarnings("unchecked") List getLocationalResult() { try { return (List) locationalResultField.get(this); } catch (IllegalArgumentException | IllegalAccessException e) { e.printStackTrace(); throw new RuntimeException(e); } } final Field locationalResultField; final TextLineFinder textLineFinder; } 

( Horizo​​ntalTextExtractionStrategy.java )

TextExtractionStrategy使用TextLineFinder来标识水平文本行,然后使用这些信息对文本块进行排序。

请注意,此代码使用reflection来访问私有父类成员。 在所有环境中可能都不允许这样做。 在这种情况下,只需复制LocationTextExtractionStrategy并直接插入代码即可。

提取文本

现在可以使用此文本提取策略来提取带有内联上标和下标的文本,如下所示:

 String extract(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException { return PdfTextExtractor.getTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy()); } 

(来自ExtractSuperAndSubInLine.java )

OP的文档第11页上的示例文本,在“COMBUSTION EFFICIENCY”下,现在提取如下:

 monoxide (CO) in flue gas in accordance with the following formula: CE = [CO 2/(CO + CO 2 )] 

使用C#和iTextSharp的方法相同

以Java为中心的部分的解释,警告和示例结果仍然适用,这里是代码:

我正在使用iTextSharp 5.5.7。

一种识别线条的方法

 public class TextLineFinder : IRenderListener { public void BeginTextBlock() { } public void EndTextBlock() { } public void RenderImage(ImageRenderInfo renderInfo) { } public void RenderText(TextRenderInfo renderInfo) { LineSegment ascentLine = renderInfo.GetAscentLine(); LineSegment descentLine = renderInfo.GetDescentLine(); float[] yCoords = new float[]{ ascentLine.GetStartPoint()[Vector.I2], ascentLine.GetEndPoint()[Vector.I2], descentLine.GetStartPoint()[Vector.I2], descentLine.GetEndPoint()[Vector.I2] }; Array.Sort(yCoords); addVerticalUseSection(yCoords[0], yCoords[3]); } void addVerticalUseSection(float from, float to) { if (to < from) { float temp = to; to = from; from = temp; } int i=0, j=0; for (; i i) verticalFlips.RemoveAt(j); if (toOutsideInterval) verticalFlips.Insert(i, to); if (fromOutsideInterval) verticalFlips.Insert(i, from); } public List verticalFlips = new List(); } 

按这些行对文本块进行排序

 public class HorizontalTextExtractionStrategy : LocationTextExtractionStrategy { public class HorizontalTextChunk : TextChunk { public HorizontalTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, TextLineFinder textLineFinder) : base(stringValue, startLocation, endLocation, charSpaceWidth) { this.textLineFinder = textLineFinder; } override public int CompareTo(TextChunk rhs) { if (rhs is HorizontalTextChunk) { HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs; int rslt = CompareInts(getLineNumber(), horRhs.getLineNumber()); if (rslt != 0) return rslt; return CompareFloats(StartLocation[Vector.I1], rhs.StartLocation[Vector.I1]); } else return base.CompareTo(rhs); } public override bool SameLine(TextChunk a) { if (a is HorizontalTextChunk) { HorizontalTextChunk horAs = (HorizontalTextChunk) a; return getLineNumber() == horAs.getLineNumber(); } else return base.SameLine(a); } public int getLineNumber() { Vector startLocation = StartLocation; float y = startLocation[Vector.I2]; List flips = textLineFinder.verticalFlips; if (flips == null || flips.Count == 0) return 0; if (y < flips[0]) return flips.Count / 2 + 1; for (int i = 1; i < flips.Count; i+=2) { if (y < flips[i]) { return (1 + flips.Count - i) / 2; } } return 0; } private static int CompareInts(int int1, int int2){ return int1 == int2 ? 0 : int1 < int2 ? -1 : 1; } private static int CompareFloats(float float1, float float2) { return float1 == float2 ? 0 : float1 < float2 ? -1 : 1; } TextLineFinder textLineFinder; } public override void RenderText(TextRenderInfo renderInfo) { textLineFinder.RenderText(renderInfo); LineSegment segment = renderInfo.GetBaseline(); if (renderInfo.GetRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise()); segment = segment.TransformBy(riseOffsetTransform); } TextChunk location = new HorizontalTextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), textLineFinder); getLocationalResult().Add(location); } public HorizontalTextExtractionStrategy() { locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance); textLineFinder = new TextLineFinder(); } List getLocationalResult() { return (List) locationalResultField.GetValue(this); } System.Reflection.FieldInfo locationalResultField; TextLineFinder textLineFinder; } 

提取文本

  string extract(PdfReader reader, int pageNo) { return PdfTextExtractor.GetTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy()); } 

更新: LocationTextExtractionStrategy更改

在iText 5.5.9-SNAPSHOT中通过1ab350beae148be2a4bef5e663b3d67a004ff9f8(“使TextChunkLocation成为可比较的<>类……”)提交53526e4854fcb80c86cbc2e113f7a07401dc9a67(“Refactor LocationTextExtractionStrategy …”), LocationTextExtractionStrategy体系结构已更改为允许这样的自定义,而无需reflection。

不幸的是,这个更改打破了上面提到的Horizo​​ntalTextExtractionStrategy。 对于提交后的iText版本,可以使用以下策略:

 public class HorizontalTextExtractionStrategy2 extends LocationTextExtractionStrategy { public static class HorizontalTextChunkLocationStrategy implements TextChunkLocationStrategy { public HorizontalTextChunkLocationStrategy(TextLineFinder textLineFinder) { this.textLineFinder = textLineFinder; } @Override public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) { return new HorizontalTextChunkLocation(baseline.getStartPoint(), baseline.getEndPoint(), renderInfo.getSingleSpaceWidth()); } final TextLineFinder textLineFinder; public class HorizontalTextChunkLocation implements TextChunkLocation { /** the starting location of the chunk */ private final Vector startLocation; /** the ending location of the chunk */ private final Vector endLocation; /** unit vector in the orientation of the chunk */ private final Vector orientationVector; /** the orientation as a scalar for quick sorting */ private final int orientationMagnitude; /** perpendicular distance to the orientation unit vector (ie the Y position in an unrotated coordinate system) * we round to the nearest integer to handle the fuzziness of comparing floats */ private final int distPerpendicular; /** distance of the start of the chunk parallel to the orientation unit vector (ie the X position in an unrotated coordinate system) */ private final float distParallelStart; /** distance of the end of the chunk parallel to the orientation unit vector (ie the X position in an unrotated coordinate system) */ private final float distParallelEnd; /** the width of a single space character in the font of the chunk */ private final float charSpaceWidth; public HorizontalTextChunkLocation(Vector startLocation, Vector endLocation, float charSpaceWidth) { this.startLocation = startLocation; this.endLocation = endLocation; this.charSpaceWidth = charSpaceWidth; Vector oVector = endLocation.subtract(startLocation); if (oVector.length() == 0) { oVector = new Vector(1, 0, 0); } orientationVector = oVector.normalize(); orientationMagnitude = (int)(Math.atan2(orientationVector.get(Vector.I2), orientationVector.get(Vector.I1))*1000); // see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html // the two vectors we are crossing are in the same plane, so the result will be purely // in the z-axis (out of plane) direction, so we just take the I3 component of the result Vector origin = new Vector(0,0,1); distPerpendicular = (int)(startLocation.subtract(origin)).cross(orientationVector).get(Vector.I3); distParallelStart = orientationVector.dot(startLocation); distParallelEnd = orientationVector.dot(endLocation); } public int orientationMagnitude() { return orientationMagnitude; } public int distPerpendicular() { return distPerpendicular; } public float distParallelStart() { return distParallelStart; } public float distParallelEnd() { return distParallelEnd; } public Vector getStartLocation() { return startLocation; } public Vector getEndLocation() { return endLocation; } public float getCharSpaceWidth() { return charSpaceWidth; } /** * @param as the location to compare to * @return true is this location is on the the same line as the other */ public boolean sameLine(TextChunkLocation as) { if (as instanceof HorizontalTextChunkLocation) { HorizontalTextChunkLocation horAs = (HorizontalTextChunkLocation) as; return getLineNumber() == horAs.getLineNumber(); } else return orientationMagnitude() == as.orientationMagnitude() && distPerpendicular() == as.distPerpendicular(); } /** * Computes the distance between the end of 'other' and the beginning of this chunk * in the direction of this chunk's orientation vector. Note that it's a bad idea * to call this for chunks that aren't on the same line and orientation, but we don't * explicitly check for that condition for performance reasons. * @param other * @return the number of spaces between the end of 'other' and the beginning of this chunk */ public float distanceFromEndOf(TextChunkLocation other) { float distance = distParallelStart() - other.distParallelEnd(); return distance; } public boolean isAtWordBoundary(TextChunkLocation previous) { /** * Here we handle a very specific case which in PDF may look like: * -.232 Tc [( P)-226.2(r)-231.8(e)-230.8(f)-238(a)-238.9(c)-228.9(e)]TJ * The font's charSpace width is 0.232 and it's compensated with charSpacing of 0.232. * And a resultant TextChunk.charSpaceWidth comes to TextChunk constructor as 0. * In this case every chunk is considered as a word boundary and space is added. * We should consider charSpaceWidth equal (or close) to zero as a no-space. */ if (getCharSpaceWidth() < 0.1f) return false; float dist = distanceFromEndOf(previous); return dist < -getCharSpaceWidth() || dist > getCharSpaceWidth()/2.0f; } public int getLineNumber() { Vector startLocation = getStartLocation(); float y = startLocation.get(Vector.I2); List flips = textLineFinder.verticalFlips; if (flips == null || flips.isEmpty()) return 0; if (y < flips.get(0)) return flips.size() / 2 + 1; for (int i = 1; i < flips.size(); i+=2) { if (y < flips.get(i)) { return (1 + flips.size() - i) / 2; } } return 0; } @Override public int compareTo(TextChunkLocation rhs) { if (rhs instanceof HorizontalTextChunkLocation) { HorizontalTextChunkLocation horRhs = (HorizontalTextChunkLocation) rhs; int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber()); if (rslt != 0) return rslt; return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1)); } else { int rslt; rslt = Integer.compare(orientationMagnitude(), rhs.orientationMagnitude()); if (rslt != 0) return rslt; rslt = Integer.compare(distPerpendicular(), rhs.distPerpendicular()); if (rslt != 0) return rslt; return Float.compare(distParallelStart(), rhs.distParallelStart()); } } } } @Override public void renderText(TextRenderInfo renderInfo) { textLineFinder.renderText(renderInfo); super.renderText(renderInfo); } public HorizontalTextExtractionStrategy2() throws NoSuchFieldException, SecurityException { this(new TextLineFinder()); } public HorizontalTextExtractionStrategy2(TextLineFinder textLineFinder) throws NoSuchFieldException, SecurityException { super(new HorizontalTextChunkLocationStrategy(textLineFinder)); this.textLineFinder = textLineFinder; } final TextLineFinder textLineFinder; } 

( Horizo​​ntalTextExtractionStrategy2.java )

我刚刚解决了类似的问题,请看我的问题 。 我将下标检测为在前一文本的升序和降序行之间具有基线的文本。 这段代码可能很有用:

  Vector thisFacade = this.ascentLine.GetStartPoint().Subtract(this.descentLine.GetStartPoint()); Vector infoFacade = renderInfo.GetAscentLine().GetStartPoint().Subtract(renderInfo.GetDescentLine().GetStartPoint()); if (baseVector.Cross(ascent2base).Dot(baseVector.Cross(descent2base)) < 0 && infoFacade.LengthSquared < thisFacade.LengthSquared - sameHeightThreshols) 

Chistmass之后的更多细节。