C#使用PdfSharp从PDF中提取文本

是否有可能使用PdfSharp从PDF文件中提取纯文本? 由于其许可证,我不想使用iTextSharp。

接受塞尔吉奥的回答并做了一些扩展方法。 我还将字符串的累积更改为迭代器。

public static class PdfSharpExtensions { public static IEnumerable ExtractText(this PdfPage page) { var content = ContentReader.ReadContent(page); var text = content.ExtractText(); return text; } public static IEnumerable ExtractText(this CObject cObject) { if (cObject is COperator) { var cOperator = cObject as COperator; if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() || cOperator.OpCode.Name == OpCodeName.TJ.ToString()) { foreach (var cOperand in cOperator.Operands) foreach (var txt in ExtractText(cOperand)) yield return txt; } } else if (cObject is CSequence) { var cSequence = cObject as CSequence; foreach (var element in cSequence) foreach (var txt in ExtractText(element)) yield return txt; } else if (cObject is CString) { var cString = cObject as CString; yield return cString.Value; } } } 

我已经以某种方式实现了它,类似于David的做法。 这是我的代码:

  { // .... var page = document.Pages[1]; CObject content = ContentReader.ReadContent(page); var extractedText = ExtractText(content); // ... } private IEnumerable ExtractText(CObject cObject ) { var textList = new List(); if (cObject is COperator) { var cOperator = cObject as COperator; if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() || cOperator.OpCode.Name == OpCodeName.TJ.ToString()) { foreach (var cOperand in cOperator.Operands) { textList.AddRange(ExtractText(cOperand)); } } } else if (cObject is CSequence) { var cSequence = cObject as CSequence; foreach (var element in cSequence) { textList.AddRange(ExtractText(element)); } } else if (cObject is CString) { var cString = cObject as CString; textList.Add(cString.Value); } return textList; } 

PDFSharp提供了从PDF中提取文本的所有工具。 使用ContentReader类访问每个页面中的命令,并从TJ / Tj运算符中提取字符串。

我已经向github上传了一个简单的实现。