

TextChunk.SameLine()需要两个块具有相同的垂直定位才能“在”同一行上,而上标或下标文本不是这种情况。 例如,在本文档的第11页上,在“燃烧效率”下:



 monoxide (CO) in flue gas in accordance with the following formula: CE = [CO2 /(CO + CO2)] 


我将SameLine()移动到LocationTextExtractionStrategy并为它读取的私有TextChunk属性创建了公共getter。 这允许我在我自己的子类中动态调整容差,如下所示:

 public class SubSuperStrategy : LocationTextExtractionStrategy { public int SameLineOrientationTolerance { get; set; } public int SameLineDistanceTolerance { get; set; } public override bool SameLine(TextChunk chunk1, TextChunk chunk2) { var orientationDelta = Math.Abs(chunk1.OrientationMagnitude - chunk2.OrientationMagnitude); if(orientationDelta > SameLineOrientationTolerance) return false; var distDelta = Math.Abs(chunk1.DistPerpendicular - chunk2.DistPerpendicular); return (distDelta <= SameLineDistanceTolerance); } } 

使用SameLineDistanceTolerance3 ,这将更正子/超级块分配给哪一 ,但文本的相对位置是关闭的:

有时块会插入文本中间的某个位置,有时(如本示例所示)最后插入。 无论哪种方式,他们都不会在正确的地方结束。 我怀疑这可能与字体大小有关,但我很难理解这段代码的大小。



要正确地提取这些下标和上标,需要一种不同的方法来检查两个文本块是否在同一行。 以下类代表一种这样的方法。

我更喜欢Java / iText; 因此,我首先在Java中实现了这种方法,然后才将其转换为C#/ iTextSharp。


我正在使用当前开发分支iText 5.5.8-SNAPSHOT。



 public class TextLineFinder implements RenderListener { @Override public void beginTextBlock() { } @Override public void endTextBlock() { } @Override public void renderImage(ImageRenderInfo renderInfo) { } /* * @see RenderListener#renderText(TextRenderInfo) */ @Override public void renderText(TextRenderInfo renderInfo) { LineSegment ascentLine = renderInfo.getAscentLine(); LineSegment descentLine = renderInfo.getDescentLine(); float[] yCoords = new float[]{ ascentLine.getStartPoint().get(Vector.I2), ascentLine.getEndPoint().get(Vector.I2), descentLine.getStartPoint().get(Vector.I2), descentLine.getEndPoint().get(Vector.I2) }; Arrays.sort(yCoords); addVerticalUseSection(yCoords[0], yCoords[3]); } /** * This method marks the given interval as used. */ void addVerticalUseSection(float from, float to) { if (to < from) { float temp = to; to = from; from = temp; } int i=0, j=0; for (; i i) verticalFlips.remove(j); if (toOutsideInterval) verticalFlips.add(i, to); if (fromOutsideInterval) verticalFlips.add(i, from); } final List verticalFlips = new ArrayList(); } 

( TextLineFinder.java )

RenderListener尝试通过将文本边界框投影到y轴上来识别水平文本行。 它假设这些投影对于来自不同行的文本不重叠,即使在下标和上标的情况下也是如此。




 public class HorizontalTextExtractionStrategy extends LocationTextExtractionStrategy { public class HorizontalTextChunk extends TextChunk { public HorizontalTextChunk(String string, Vector startLocation, Vector endLocation, float charSpaceWidth) { super(string, startLocation, endLocation, charSpaceWidth); } @Override public int compareTo(TextChunk rhs) { if (rhs instanceof HorizontalTextChunk) { HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs; int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber()); if (rslt != 0) return rslt; return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1)); } else return super.compareTo(rhs); } @Override public boolean sameLine(TextChunk as) { if (as instanceof HorizontalTextChunk) { HorizontalTextChunk horAs = (HorizontalTextChunk) as; return getLineNumber() == horAs.getLineNumber(); } else return super.sameLine(as); } public int getLineNumber() { Vector startLocation = getStartLocation(); float y = startLocation.get(Vector.I2); List flips = textLineFinder.verticalFlips; if (flips == null || flips.isEmpty()) return 0; if (y < flips.get(0)) return flips.size() / 2 + 1; for (int i = 1; i < flips.size(); i+=2) { if (y < flips.get(i)) { return (1 + flips.size() - i) / 2; } } return 0; } } @Override public void renderText(TextRenderInfo renderInfo) { textLineFinder.renderText(renderInfo); LineSegment segment = renderInfo.getBaseline(); if (renderInfo.getRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.getRise()); segment = segment.transformBy(riseOffsetTransform); } TextChunk location = new HorizontalTextChunk(renderInfo.getText(), segment.getStartPoint(), segment.getEndPoint(), renderInfo.getSingleSpaceWidth()); getLocationalResult().add(location); } public HorizontalTextExtractionStrategy() throws NoSuchFieldException, SecurityException { locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult"); locationalResultField.setAccessible(true); textLineFinder = new TextLineFinder(); } @SuppressWarnings("unchecked") List getLocationalResult() { try { return (List) locationalResultField.get(this); } catch (IllegalArgumentException | IllegalAccessException e) { e.printStackTrace(); throw new RuntimeException(e); } } final Field locationalResultField; final TextLineFinder textLineFinder; } 

( Horizo​​ntalTextExtractionStrategy.java )


请注意,此代码使用reflection来访问私有父类成员。 在所有环境中可能都不允许这样做。 在这种情况下,只需复制LocationTextExtractionStrategy并直接插入代码即可。



 String extract(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException { return PdfTextExtractor.getTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy()); } 

(来自ExtractSuperAndSubInLine.java )

OP的文档第11页上的示例文本,在“COMBUSTION EFFICIENCY”下,现在提取如下:

 monoxide (CO) in flue gas in accordance with the following formula: CE = [CO 2/(CO + CO 2 )] 



我正在使用iTextSharp 5.5.7。


 public class TextLineFinder : IRenderListener { public void BeginTextBlock() { } public void EndTextBlock() { } public void RenderImage(ImageRenderInfo renderInfo) { } public void RenderText(TextRenderInfo renderInfo) { LineSegment ascentLine = renderInfo.GetAscentLine(); LineSegment descentLine = renderInfo.GetDescentLine(); float[] yCoords = new float[]{ ascentLine.GetStartPoint()[Vector.I2], ascentLine.GetEndPoint()[Vector.I2], descentLine.GetStartPoint()[Vector.I2], descentLine.GetEndPoint()[Vector.I2] }; Array.Sort(yCoords); addVerticalUseSection(yCoords[0], yCoords[3]); } void addVerticalUseSection(float from, float to) { if (to < from) { float temp = to; to = from; from = temp; } int i=0, j=0; for (; i i) verticalFlips.RemoveAt(j); if (toOutsideInterval) verticalFlips.Insert(i, to); if (fromOutsideInterval) verticalFlips.Insert(i, from); } public List verticalFlips = new List(); } 


 public class HorizontalTextExtractionStrategy : LocationTextExtractionStrategy { public class HorizontalTextChunk : TextChunk { public HorizontalTextChunk(String stringValue, Vector startLocation, Vector endLocation, float charSpaceWidth, TextLineFinder textLineFinder) : base(stringValue, startLocation, endLocation, charSpaceWidth) { this.textLineFinder = textLineFinder; } override public int CompareTo(TextChunk rhs) { if (rhs is HorizontalTextChunk) { HorizontalTextChunk horRhs = (HorizontalTextChunk) rhs; int rslt = CompareInts(getLineNumber(), horRhs.getLineNumber()); if (rslt != 0) return rslt; return CompareFloats(StartLocation[Vector.I1], rhs.StartLocation[Vector.I1]); } else return base.CompareTo(rhs); } public override bool SameLine(TextChunk a) { if (a is HorizontalTextChunk) { HorizontalTextChunk horAs = (HorizontalTextChunk) a; return getLineNumber() == horAs.getLineNumber(); } else return base.SameLine(a); } public int getLineNumber() { Vector startLocation = StartLocation; float y = startLocation[Vector.I2]; List flips = textLineFinder.verticalFlips; if (flips == null || flips.Count == 0) return 0; if (y < flips[0]) return flips.Count / 2 + 1; for (int i = 1; i < flips.Count; i+=2) { if (y < flips[i]) { return (1 + flips.Count - i) / 2; } } return 0; } private static int CompareInts(int int1, int int2){ return int1 == int2 ? 0 : int1 < int2 ? -1 : 1; } private static int CompareFloats(float float1, float float2) { return float1 == float2 ? 0 : float1 < float2 ? -1 : 1; } TextLineFinder textLineFinder; } public override void RenderText(TextRenderInfo renderInfo) { textLineFinder.RenderText(renderInfo); LineSegment segment = renderInfo.GetBaseline(); if (renderInfo.GetRise() != 0){ // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise()); segment = segment.TransformBy(riseOffsetTransform); } TextChunk location = new HorizontalTextChunk(renderInfo.GetText(), segment.GetStartPoint(), segment.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), textLineFinder); getLocationalResult().Add(location); } public HorizontalTextExtractionStrategy() { locationalResultField = typeof(LocationTextExtractionStrategy).GetField("locationalResult", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance); textLineFinder = new TextLineFinder(); } List getLocationalResult() { return (List) locationalResultField.GetValue(this); } System.Reflection.FieldInfo locationalResultField; TextLineFinder textLineFinder; } 


  string extract(PdfReader reader, int pageNo) { return PdfTextExtractor.GetTextFromPage(reader, pageNo, new HorizontalTextExtractionStrategy()); } 

更新: LocationTextExtractionStrategy更改

在iText 5.5.9-SNAPSHOT中通过1ab350beae148be2a4bef5e663b3d67a004ff9f8(“使TextChunkLocation成为可比较的<>类……”)提交53526e4854fcb80c86cbc2e113f7a07401dc9a67(“Refactor LocationTextExtractionStrategy …”), LocationTextExtractionStrategy体系结构已更改为允许这样的自定义,而无需reflection。

不幸的是,这个更改打破了上面提到的Horizo​​ntalTextExtractionStrategy。 对于提交后的iText版本,可以使用以下策略:

 public class HorizontalTextExtractionStrategy2 extends LocationTextExtractionStrategy { public static class HorizontalTextChunkLocationStrategy implements TextChunkLocationStrategy { public HorizontalTextChunkLocationStrategy(TextLineFinder textLineFinder) { this.textLineFinder = textLineFinder; } @Override public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) { return new HorizontalTextChunkLocation(baseline.getStartPoint(), baseline.getEndPoint(), renderInfo.getSingleSpaceWidth()); } final TextLineFinder textLineFinder; public class HorizontalTextChunkLocation implements TextChunkLocation { /** the starting location of the chunk */ private final Vector startLocation; /** the ending location of the chunk */ private final Vector endLocation; /** unit vector in the orientation of the chunk */ private final Vector orientationVector; /** the orientation as a scalar for quick sorting */ private final int orientationMagnitude; /** perpendicular distance to the orientation unit vector (ie the Y position in an unrotated coordinate system) * we round to the nearest integer to handle the fuzziness of comparing floats */ private final int distPerpendicular; /** distance of the start of the chunk parallel to the orientation unit vector (ie the X position in an unrotated coordinate system) */ private final float distParallelStart; /** distance of the end of the chunk parallel to the orientation unit vector (ie the X position in an unrotated coordinate system) */ private final float distParallelEnd; /** the width of a single space character in the font of the chunk */ private final float charSpaceWidth; public HorizontalTextChunkLocation(Vector startLocation, Vector endLocation, float charSpaceWidth) { this.startLocation = startLocation; this.endLocation = endLocation; this.charSpaceWidth = charSpaceWidth; Vector oVector = endLocation.subtract(startLocation); if (oVector.length() == 0) { oVector = new Vector(1, 0, 0); } orientationVector = oVector.normalize(); orientationMagnitude = (int)(Math.atan2(orientationVector.get(Vector.I2), orientationVector.get(Vector.I1))*1000); // see http://mathworld.wolfram.com/Point-LineDistance2-Dimensional.html // the two vectors we are crossing are in the same plane, so the result will be purely // in the z-axis (out of plane) direction, so we just take the I3 component of the result Vector origin = new Vector(0,0,1); distPerpendicular = (int)(startLocation.subtract(origin)).cross(orientationVector).get(Vector.I3); distParallelStart = orientationVector.dot(startLocation); distParallelEnd = orientationVector.dot(endLocation); } public int orientationMagnitude() { return orientationMagnitude; } public int distPerpendicular() { return distPerpendicular; } public float distParallelStart() { return distParallelStart; } public float distParallelEnd() { return distParallelEnd; } public Vector getStartLocation() { return startLocation; } public Vector getEndLocation() { return endLocation; } public float getCharSpaceWidth() { return charSpaceWidth; } /** * @param as the location to compare to * @return true is this location is on the the same line as the other */ public boolean sameLine(TextChunkLocation as) { if (as instanceof HorizontalTextChunkLocation) { HorizontalTextChunkLocation horAs = (HorizontalTextChunkLocation) as; return getLineNumber() == horAs.getLineNumber(); } else return orientationMagnitude() == as.orientationMagnitude() && distPerpendicular() == as.distPerpendicular(); } /** * Computes the distance between the end of 'other' and the beginning of this chunk * in the direction of this chunk's orientation vector. Note that it's a bad idea * to call this for chunks that aren't on the same line and orientation, but we don't * explicitly check for that condition for performance reasons. * @param other * @return the number of spaces between the end of 'other' and the beginning of this chunk */ public float distanceFromEndOf(TextChunkLocation other) { float distance = distParallelStart() - other.distParallelEnd(); return distance; } public boolean isAtWordBoundary(TextChunkLocation previous) { /** * Here we handle a very specific case which in PDF may look like: * -.232 Tc [( P)-226.2(r)-231.8(e)-230.8(f)-238(a)-238.9(c)-228.9(e)]TJ * The font's charSpace width is 0.232 and it's compensated with charSpacing of 0.232. * And a resultant TextChunk.charSpaceWidth comes to TextChunk constructor as 0. * In this case every chunk is considered as a word boundary and space is added. * We should consider charSpaceWidth equal (or close) to zero as a no-space. */ if (getCharSpaceWidth() < 0.1f) return false; float dist = distanceFromEndOf(previous); return dist < -getCharSpaceWidth() || dist > getCharSpaceWidth()/2.0f; } public int getLineNumber() { Vector startLocation = getStartLocation(); float y = startLocation.get(Vector.I2); List flips = textLineFinder.verticalFlips; if (flips == null || flips.isEmpty()) return 0; if (y < flips.get(0)) return flips.size() / 2 + 1; for (int i = 1; i < flips.size(); i+=2) { if (y < flips.get(i)) { return (1 + flips.size() - i) / 2; } } return 0; } @Override public int compareTo(TextChunkLocation rhs) { if (rhs instanceof HorizontalTextChunkLocation) { HorizontalTextChunkLocation horRhs = (HorizontalTextChunkLocation) rhs; int rslt = Integer.compare(getLineNumber(), horRhs.getLineNumber()); if (rslt != 0) return rslt; return Float.compare(getStartLocation().get(Vector.I1), rhs.getStartLocation().get(Vector.I1)); } else { int rslt; rslt = Integer.compare(orientationMagnitude(), rhs.orientationMagnitude()); if (rslt != 0) return rslt; rslt = Integer.compare(distPerpendicular(), rhs.distPerpendicular()); if (rslt != 0) return rslt; return Float.compare(distParallelStart(), rhs.distParallelStart()); } } } } @Override public void renderText(TextRenderInfo renderInfo) { textLineFinder.renderText(renderInfo); super.renderText(renderInfo); } public HorizontalTextExtractionStrategy2() throws NoSuchFieldException, SecurityException { this(new TextLineFinder()); } public HorizontalTextExtractionStrategy2(TextLineFinder textLineFinder) throws NoSuchFieldException, SecurityException { super(new HorizontalTextChunkLocationStrategy(textLineFinder)); this.textLineFinder = textLineFinder; } final TextLineFinder textLineFinder; } 

( Horizo​​ntalTextExtractionStrategy2.java )

我刚刚解决了类似的问题,请看我的问题 。 我将下标检测为在前一文本的升序和降序行之间具有基线的文本。 这段代码可能很有用:

  Vector thisFacade = this.ascentLine.GetStartPoint().Subtract(this.descentLine.GetStartPoint()); Vector infoFacade = renderInfo.GetAscentLine().GetStartPoint().Subtract(renderInfo.GetDescentLine().GetStartPoint()); if (baseVector.Cross(ascent2base).Dot(baseVector.Cross(descent2base)) < 0 && infoFacade.LengthSquared < thisFacade.LengthSquared - sameHeightThreshols) 
