使用itextsharp从PDF中提取图像

我试图使用itextsharp从pdf中提取所有图像,但似乎无法克服这一个障碍。

System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);行出现错误System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS); 给出“参数无效”错误。

我认为它适用于图像是位图而不是任何其他格式的图像。

我有以下代码 – 抱歉长度;

  private void Form1_Load(object sender, EventArgs e) { FileStream fs = File.OpenRead(@"reader.pdf"); byte[] data = new byte[fs.Length]; fs.Read(data, 0, (int)fs.Length); List ImgList = new List(); iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null; iTextSharp.text.pdf.PdfReader PDFReaderObj = null; iTextSharp.text.pdf.PdfObject PDFObj = null; iTextSharp.text.pdf.PdfStream PDFStremObj = null; try { RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data); PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null); for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++) { PDFObj = PDFReaderObj.GetPdfObject(i); if ((PDFObj != null) && PDFObj.IsStream()) { PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj; iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE); if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString()) { byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj); if ((bytes != null)) { try { System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes); MS.Position = 0; System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS); ImgList.Add(ImgPDF); } catch (Exception) { } } } } } PDFReaderObj.Close(); } catch (Exception ex) { throw new Exception(ex.Message); } } //Form1_Load 

我过去使用过这个库没有任何问题。

http://www.winnovative-software.com/PdfImgExtractor.aspx

 private void btnExtractImages_Click(object sender, EventArgs e) { if (pdfFileTextBox.Text.Trim().Equals(String.Empty)) { MessageBox.Show("Please choose a source PDF file", "Choose PDF file", MessageBoxButtons.OK); return; } // the source pdf file string pdfFileName = pdfFileTextBox.Text.Trim(); // start page number int startPageNumber = int.Parse(textBoxStartPage.Text.Trim()); // end page number // when it is 0 the extraction will continue up to the end of document int endPageNumber = 0; if (textBoxEndPage.Text.Trim() != String.Empty) endPageNumber = int.Parse(textBoxEndPage.Text.Trim()); // create the PDF images extractor object PdfImagesExtractor pdfImagesExtractor = new PdfImagesExtractor(); pdfImagesExtractor.LicenseKey = "31FAUEJHUEBQRl5AUENBXkFCXklJSUlQQA=="; // the demo output directory string outputDirectory = Path.Combine(Application.StartupPath, @"DemoFiles\Output"); Cursor = Cursors.WaitCursor; // set the handler to be called when an image was extracted pdfImagesExtractor.ImageExtractedEvent += pdfImagesExtractor_ImageExtractedEvent; try { // start images counting imageIndex = 0; // call the images extractor to raise the ImageExtractedEvent event when an images is extracted from a PDF page // the pdfImagesExtractor_ImageExtractedEvent handler below will be executed for each extracted image pdfImagesExtractor.ExtractImagesInEvent(pdfFileName, startPageNumber, endPageNumber); // Alternatively you can use the ExtractImages() and ExtractImagesToFile() methods // to extracted the images from a PDF document in memory or to image files in a directory // uncomment the line below to extract the images to an array of ExtractedImage objects //ExtractedImage[] pdfPageImages = pdfImagesExtractor.ExtractImages(pdfFileName, startPageNumber, endPageNumber); // uncomment the lines below to extract the images to image files in a directory //string outputDirectory = System.IO.Path.Combine(Application.StartupPath, @"DemoFiles\Output"); //pdfImagesExtractor.ExtractImagesToFile(pdfFileName, startPageNumber, endPageNumber, outputDirectory, "pdfimage"); } catch (Exception ex) { // The extraction failed MessageBox.Show(String.Format("An error occurred. {0}", ex.Message), "Error"); return; } finally { // uninstall the event handler pdfImagesExtractor.ImageExtractedEvent -= pdfImagesExtractor_ImageExtractedEvent; Cursor = Cursors.Arrow; } try { System.Diagnostics.Process.Start(outputDirectory); } catch (Exception ex) { MessageBox.Show(string.Format("Cannot open output folder. {0}", ex.Message)); return; } } ///  /// The ImageExtractedEvent event handler called after an image was extracted from a PDF page. /// The event is raised when the ExtractImagesInEvent() method is used ///  /// The handler argument containing the extracted image and the PDF page number void pdfImagesExtractor_ImageExtractedEvent(ImageExtractedEventArgs args) { // get the image object and page number from even handler argument Image pdfPageImageObj = args.ExtractedImage.ImageObject; int pageNumber = args.ExtractedImage.PageNumber; // save the extracted image to a PNG file string outputPageImage = Path.Combine(Application.StartupPath, @"DemoFiles\Output", "pdfimage_" + pageNumber.ToString() + "_" + imageIndex++ + ".png"); pdfPageImageObj.Save(outputPageImage, ImageFormat.Png); args.ExtractedImage.Dispose(); } 

解决…

即使我得到了“参数无效”的相同例外,经过der_chirurg(http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx)提供的链接的帮助,我解决了它,以下是码:

 using System.Drawing; using System.Drawing.Imaging; using System.IO; using iTextSharp.text.pdf.parser; using Dotnet = System.Drawing.Image; using iTextSharp.text.pdf; namespace PDF_Parsing { partial class PDF_ImgExtraction { string imgPath; private void ExtractImage(string pdfFile) { PdfReader pdfReader = new PdfReader(files[fileIndex]); for (int pageNumber = 1; pageNumber <= pdfReader.NumberOfPages; pageNumber++) { PdfReader pdf = new PdfReader(pdfFile); PdfDictionary pg = pdf.GetPageN(pageNumber); PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)); PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)); foreach (PdfName name in xobj.Keys) { PdfObject obj = xobj.Get(name); if (obj.IsIndirect()) { PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj); string width = tg.Get(PdfName.WIDTH).ToString(); string height = tg.Get(PdfName.HEIGHT).ToString(); ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new Matrix(float.Parse(width), float.Parse(height)), (PRIndirectReference)obj, tg); RenderImage(imgRI); } } } } private void RenderImage(ImageRenderInfo renderInfo) { PdfImageObject image = renderInfo.GetImage(); using (Dotnet dotnetImg = image.GetDrawingImage()) { if (dotnetImg != null) { using (MemoryStream ms = new MemoryStream()) { dotnetImg.Save(ms, ImageFormat.Tiff); Bitmap d = new Bitmap(dotnetImg); d.Save(imgPath); } } } } } } 

您需要检查流的/filter以查看给定图像使用的图像格式。 它可能是标准的图像格式:

  • DCTDecode(jpeg)
  • JPXDecode(jpeg 2000)
  • JBIG2Decode(jbig是仅限B&W的格式)
  • CCITTFaxDecode(传真格式,PDF支持第3组和第4组)

除此之外,您需要获取原始字节(就像您一样),并使用图像流的宽度,高度,每个组件的位数,颜色组件的数量(可能是CMYK,索引,RGB或某事物)构建图像奇怪的)和其他一些,如ISO PDF规范第8.9节(免费提供)中所定义。

所以在某些情况下你的代码会起作用,但在其他情况下,它会因你提到的exception而失败。

PS:当你有exception时,请每次都包含堆栈跟踪。 相当喜欢加糖吗?

在较新版本的iTextSharp中, ImageRenderInfo.CreateForXObject的第一个参数不再是Matrix而是GraphicsState 。 @ der_chirurg的方法应该有效。 我使用以下链接中的信息测试自己并且它工作得非常好:

http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/

要提取所有页面上的所有图像,不必实现不同的filter。 iTextSharp有一个图像渲染器,可以将所有图像保存为原始图像类型。

只需执行以下命令: http : //kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx您不需要实现HttpHandler …

我在github上添加了库,用PDF提取图像并压缩它们。

当您打算使用function非常强大的库ITextSharp时,可能会很有用。

链接: https : //github.com/rock-walker/PdfCompression

这对我有用,我认为这是一个简单的解决方案:

编写自定义RenderListener并实现其RenderImage方法,如下所示

  public void RenderImage(ImageRenderInfo info) { PdfImageObject image = info.GetImage(); Parser.Matrix matrix = info.GetImageCTM(); var fileType = image.GetFileType(); ImageFormat format; switch (fileType) {//you may add more types here case "jpg": case "jpeg": format = ImageFormat.Jpeg; break; case "pnt": format = ImageFormat.Png; break; case "bmp": format = ImageFormat.Bmp; break; case "tiff": format = ImageFormat.Tiff; break; case "gif": format = ImageFormat.Gif; break; default: format = ImageFormat.Jpeg; break; } var pic = image.GetDrawingImage(); var x = matrix[Parser.Matrix.I31]; var y = matrix[Parser.Matrix.I32]; var width = matrix[Parser.Matrix.I11]; var height = matrix[Parser.Matrix.I22]; if (x <  && y < ) { return;//ignore these images } pic.Save(, format); }