如何从pdf文件中提取附件？

我有一大堆带有xml文件的pdf文档。我想提取那些附加的xml文件并阅读它们。如何使用.net以编程方式执行此操作？

iTextSharp也能够提取附件……虽然您可能必须使用低级别对象来执行此操作。

有两种方法可以在PDF中嵌入文件：

在文件注释中
在文档级别“EmbeddedFiles”。

从任一源获得文件规范字典后，文件本身将成为标记为“EF”（嵌入文件）的字典中的流。

因此，要列出文档级别的所有文件，可以编写代码（使用Java）：

Map files = new HashMap(); PdfReader reader = new PdfReader(pdfPath); PdfDictionary root = reader.getCatalog(); PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null int len = embeddedFiles.size(); for (int i = 0; i < len; i += 2) { PdfString name = embeddedFiles.getAsString(i); // should always be present PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto PdfDictionary streams = fileSpec.getAsDict(PdfName.EF); PRStream stream = null; if (streams.contains(PdfName.UF)) stream = (PRStream)streams.getAsStream(PdfName.UF); else stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility if (stream != null) { files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream)); } }

这是一个老问题，尽管如此，我认为我的替代解决方案（使用PDF Clown ）可能会引起一些兴趣，因为它比以前提出的代码片段更清晰（更完整，因为它在文档和页面级别都进行迭代） ：

 using org.pdfclown.bytes; using org.pdfclown.documents; using org.pdfclown.documents.files; using org.pdfclown.documents.interaction.annotations; using org.pdfclown.objects; using System; using System.Collections.Generic; void ExtractAttachments(string pdfPath) { Dictionary attachments = new Dictionary(); using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath)) { Document document = file.Document; // 1. Embedded files (document level). foreach(KeyValuePair entry in document.Names.EmbeddedFiles) {EvaluateDataFile(attachments, entry.Value);} // 2. File attachments (page level). foreach(Page page in document.Pages) { foreach(Annotation annotation in page.Annotations) { if(annotation is FileAttachment) {EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);} } } } } void EvaluateDataFile(Dictionary attachments, FileSpecification dataFile) { if(dataFile is FullFileSpecification) { EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile; if(embeddedFile != null) {attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();} } }

请注意， 您不必担心空指针exception，因为PDF Clown提供了所有必要的抽象和自动化，以确保平滑的模型遍历。

PDF Clown是一个LGPL 3库，在Java和.NET平台上实现（我是它的首席开发人员）：如果你想尝试一下，我建议你在sourceforge.net上查看它的SVN存储库。不断发展的。

在我看来，寻找ABCpdf -Library非常简单快捷。

我工作的东西与我在网上看到的其他东西略有不同。

所以，为了以防万一，我想我会在这里发布这个来帮助别人。我不得不经历许多不同的迭代来弄清楚 – 艰难的方式 – 我需要它才能让它发挥作用。

我正在将两个PDF合并为第三个PDF，其中前两个PDF中的一个可能具有需要转移到第三个PDF中的文件附件。我在ASP.NET，C＃4.0，ITextSharp 5.1.2.0中完全使用流。

  // Extract Files from Submit PDF Dictionary files = new Dictionary(); PdfDictionary names; PdfDictionary embeddedFiles; PdfArray fileSpecs; int eFLength = 0; names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream if (names != null) { embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null if (embeddedFiles != null) { fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null if (fileSpecs != null) { eFLength = fileSpecs.Size; for (int i = 0; i < eFLength; i++) { i++; //objects are in pairs and only want odd objects (1,3,5...) PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null if (fileSpec != null) { PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF); foreach (PdfName key in refs.Keys) { PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key)); if (stream != null) { files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream)); } } } } } } }

您可以尝试Aspose.Pdf.Kit for .NET 。 PdfExtractor类允许您使用两种方法提取附件：ExtractAttachment和GetAttachment。请参阅附件提取示例。

披露：我在Aspose担任开发人员传播者。

如何从pdf文件中提取附件？

IList 和IReadOnlyList

在C＃中模拟非虚方法

在静态资产和基于CDN的资产之间切换以进行开发和部署的最佳方法

Control.invoke和父控件

从String反序列化XML

如何在没有公共构造函数的情况下模拟/伪造/存根密封OracleException？

使用SMO库从C＃中的Application运行.sql文件

可以用通配符指定目录路径吗？

使用标准try / catch包装对类的方法的调用

在ef中添加对象列表到Context