如何在PDF中查找所有特定文本并在上面插入分页符?

我对PDF有一个棘手的要求

我需要在pdf中搜索特定的字符串 – 属性编号:

每次找到这个,我都需要在上面添加一个分页符

我可以访问IText和Spire.PDF,我首先看IText

我已经从其他post建立了我需要使用PDF压模

下面的逻辑添加了一个可行的新页面

但是,就我而言,我只需要分页而不是空白页

var newFile = @"c:\temp\full.pdf"; var dest = @"c:\temp\dest.pdf"; var reader = new PdfReader(newFile); if (File.Exists(dest)) { File.Delete(dest); } var stamper = new PdfStamper(reader, new FileStream(dest, FileMode.CreateNew)); var total = reader.NumberOfPages + 1; for (var pageNumber = total; pageNumber > 0; pageNumber--) { var pageContent = reader.GetPageContent(pageNumber); stamper.InsertPage(pageNumber, PageSize.A4); } stamper.Close(); reader.Close(); 

下面的图片显示了一个示例,因此这实际上是3页,即现有页面,在第一次出现的属性编号上方插入一个新的分页符:

第二次出现时需要另一个分页符

在此处输入图像描述

这个答案分享了一个概念validation,用于查找PDF中所有特定文本的出现,并使用iText和Java 在上面插入分页符 。 将它移植到iTextSharp和C#应该不会太困难。

此外,对于生产使用,必须添加一些额外的代码,因为当前代码做出一些假设,例如假定非旋转页面。 此外,它根本不处理注释。

该任务实际上是两个任务的组合, 查找插入分页符 ,因此我们需要

  1. 一些自定义文本和位置的提取策略
  2. 切割页面的工具。

SearchTextLocationExtractionStrategy

为了提取自定义文本的位置,我们扩展iText LocationTextExtractionStrategy以允许提取自定义文本文本字符串的位置,实际上是正则表达式的匹配:

 public class SearchTextLocationExtractionStrategy extends LocationTextExtractionStrategy { public SearchTextLocationExtractionStrategy(Pattern pattern) { super(new TextChunkLocationStrategy() { public TextChunkLocation createLocation(TextRenderInfo renderInfo, LineSegment baseline) { // while baseLine has been changed to not neutralize // effects of rise, ascentLine and descentLine explicitly // have not: We want the actual positions. return new AscentDescentTextChunkLocation(baseline, renderInfo.getAscentLine(), renderInfo.getDescentLine(), renderInfo.getSingleSpaceWidth()); } }); this.pattern = pattern; } static Field locationalResultField = null; static Method filterTextChunksMethod = null; static Method startsWithSpaceMethod = null; static Method endsWithSpaceMethod = null; static Field textChunkTextField = null; static Method textChunkSameLineMethod = null; static { try { locationalResultField = LocationTextExtractionStrategy.class.getDeclaredField("locationalResult"); locationalResultField.setAccessible(true); filterTextChunksMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("filterTextChunks", List.class, TextChunkFilter.class); filterTextChunksMethod.setAccessible(true); startsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("startsWithSpace", String.class); startsWithSpaceMethod.setAccessible(true); endsWithSpaceMethod = LocationTextExtractionStrategy.class.getDeclaredMethod("endsWithSpace", String.class); endsWithSpaceMethod.setAccessible(true); textChunkTextField = TextChunk.class.getDeclaredField("text"); textChunkTextField.setAccessible(true); textChunkSameLineMethod = TextChunk.class.getDeclaredMethod("sameLine", TextChunk.class); textChunkSameLineMethod.setAccessible(true); } catch (NoSuchFieldException | SecurityException | NoSuchMethodException e) { // TODO Auto-generated catch block e.printStackTrace(); } } public Collection getLocations(TextChunkFilter chunkFilter) { Collection result = new ArrayList<>(); try { List filteredTextChunks = (List) filterTextChunksMethod.invoke(this, locationalResultField.get(this), chunkFilter); Collections.sort(filteredTextChunks); StringBuilder sb = new StringBuilder(); List locations = new ArrayList<>(); TextChunk lastChunk = null; for (TextChunk chunk : filteredTextChunks) { String chunkText = (String) textChunkTextField.get(chunk); if (lastChunk == null) { // Nothing to compare with at the end } else if ((boolean) textChunkSameLineMethod.invoke(chunk, lastChunk)) { // we only insert a blank space if the trailing character of the previous string // wasn't a space, // and the leading character of the current string isn't a space if (isChunkAtWordBoundary(chunk, lastChunk) && !((boolean) startsWithSpaceMethod.invoke(this, chunkText)) && !((boolean) endsWithSpaceMethod.invoke(this, chunkText))) { sb.append(' '); LineSegment spaceBaseLine = new LineSegment(lastChunk.getEndLocation(), chunk.getStartLocation()); locations.add(new AscentDescentTextChunkLocation(spaceBaseLine, spaceBaseLine, spaceBaseLine, chunk.getCharSpaceWidth())); } } else { assert sb.length() == locations.size(); Matcher matcher = pattern.matcher(sb); while (matcher.find()) { int i = matcher.start(); Vector baseStart = locations.get(i).getStartLocation(); TextRectangle textRectangle = new TextRectangle(matcher.group(), baseStart.get(Vector.I1), baseStart.get(Vector.I2)); for (; i < matcher.end(); i++) { AscentDescentTextChunkLocation location = locations.get(i); textRectangle.add(location.getAscentLine().getBoundingRectange()); textRectangle.add(location.getDescentLine().getBoundingRectange()); } result.add(textRectangle); } sb.setLength(0); locations.clear(); } sb.append(chunkText); locations.add((AscentDescentTextChunkLocation) chunk.getLocation()); lastChunk = chunk; } } catch (IllegalAccessException | IllegalArgumentException | InvocationTargetException e) { // TODO Auto-generated catch block e.printStackTrace(); } return result; } @Override public void renderText(TextRenderInfo renderInfo) { for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) super.renderText(info); } public static class AscentDescentTextChunkLocation extends TextChunkLocationDefaultImp { public AscentDescentTextChunkLocation(LineSegment baseLine, LineSegment ascentLine, LineSegment descentLine, float charSpaceWidth) { super(baseLine.getStartPoint(), baseLine.getEndPoint(), charSpaceWidth); this.ascentLine = ascentLine; this.descentLine = descentLine; } public LineSegment getAscentLine() { return ascentLine; } public LineSegment getDescentLine() { return descentLine; } final LineSegment ascentLine; final LineSegment descentLine; } public class TextRectangle extends Rectangle2D.Float { public TextRectangle(final String text, final float xStart, final float yStart) { super(xStart, yStart, 0, 0); this.text = text; } public String getText() { return text; } final String text; } final Pattern pattern; } 

( SearchTextLocationExtractionStrategy.java )

由于基类的某些必要成员是私有的或包私有的,我们必须使用reflection来提取它们。

AbstractPdfPageSplittingTool

此工具的页面拆分function已从此答案中的PdfVeryDenseMergeTool提取。 此外,允许分页符的自定义位置是抽象的。

 public abstract class AbstractPdfPageSplittingTool { public AbstractPdfPageSplittingTool(Rectangle size, float top) { this.pageSize = size; this.topMargin = top; } public void split(OutputStream outputStream, PdfReader... inputs) throws DocumentException, IOException { try { openDocument(outputStream); for (PdfReader reader : inputs) { split(reader); } } finally { closeDocument(); } } void openDocument(OutputStream outputStream) throws DocumentException { final Document document = new Document(pageSize, 36, 36, topMargin, 36); final PdfWriter writer = PdfWriter.getInstance(document, outputStream); document.open(); this.document = document; this.writer = writer; newPage(); } void closeDocument() { try { document.close(); } finally { this.document = null; this.writer = null; this.yPosition = 0; } } void newPage() { document.newPage(); yPosition = pageSize.getTop(topMargin); } void split(PdfReader reader) throws IOException { for (int page = 1; page <= reader.getNumberOfPages(); page++) { split(reader, page); } } void split(PdfReader reader, int page) throws IOException { PdfImportedPage importedPage = writer.getImportedPage(reader, page); PdfContentByte directContent = writer.getDirectContent(); yPosition = pageSize.getTop(); Rectangle pageSizeToImport = reader.getPageSize(page); float[] borderPositions = determineSplitPositions(reader, page); if (borderPositions == null || borderPositions.length < 2) return; for (int borderIndex = 0; borderIndex + 1 < borderPositions.length; borderIndex++) { float height = borderPositions[borderIndex] - borderPositions[borderIndex + 1]; if (height <= 0) continue; directContent.saveState(); directContent.rectangle(0, yPosition - height, pageSizeToImport.getWidth(), height); directContent.clip(); directContent.newPath(); writer.getDirectContent().addTemplate(importedPage, 0, yPosition - (borderPositions[borderIndex] - pageSizeToImport.getBottom())); directContent.restoreState(); newPage(); } } protected abstract float[] determineSplitPositions(PdfReader reader, int page); Document document = null; PdfWriter writer = null; float yPosition = 0; final Rectangle pageSize; final float topMargin; } 

( AbstractPdfPageSplittingTool.java )

用法一致

执行OP的任务:

我需要在pdf中搜索特定的字符串 - 属性编号:

每次找到这个,我都需要在上面添加一个分页符

可以像上面这样使用上面的类:

 AbstractPdfPageSplittingTool tool = new AbstractPdfPageSplittingTool(PageSize.A4, 36) { @Override protected float[] determineSplitPositions(PdfReader reader, int page) { Collection locations = Collections.emptyList(); try { PdfReaderContentParser parser = new PdfReaderContentParser(reader); SearchTextLocationExtractionStrategy strategy = new SearchTextLocationExtractionStrategy( Pattern.compile("Property Number")); parser.processContent(page, strategy, Collections.emptyMap()).getResultantText(); locations = strategy.getLocations(null); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } List borders = new ArrayList<>(); for (TextRectangle rectangle : locations) { borders.add((float)rectangle.getMaxY()); } Rectangle pageSize = reader.getPageSize(page); borders.add(pageSize.getTop()); borders.add(pageSize.getBottom()); Collections.sort(borders, Collections.reverseOrder()); float[] result = new float[borders.size()]; for (int i=0; i < result.length; i++) result[i] = borders.get(i); return result; } }; tool.split(new FileOutputStream(RESULT), new PdfReader(SOURCE)); 

( SplitPages.java测试方法testSplitDocumentAboveAngestellter