<p>I need to be able to extract text / recognize tables (something pretty easy with PDFBox or iText in Java). So far I found some libs:</p>
<ul>
<li><a href="https://github.com/rsc/pdf">https://github.com/rsc/pdf</a></li>
<li><a href="https://github.com/yob/pdfreader">https://github.com/yob/pdfreader</a></li>
<li><a href="https://github.com/hhrutter/pdfcpu">https://github.com/hhrutter/pdfcpu</a></li>
</ul>
<p>But want some real users to share their experience / recommend something.</p>
<p>Thanks!</p>
<hr/>**评论:**<br/><br/>thucle: <pre><p>I have forked from rsc/pdf to <a href="https://github.com/ledongthuc/pdf" rel="nofollow">https://github.com/ledongthuc/pdf</a> and support some funcs to read data from PDF. The idea's I want to get data from Linkedin Profile. So hope it's useful for you</p></pre>peterwilliams97: <pre><p>I need to do this too. I couldn't find any Go libraries that have come close to doing the considerable work PDFBox has done on text extraction. Exec'ing PDFBox from a Go program works fine so I am not worried about the lack of a native Go library for doing this.</p>
<p><a href="https://github.com/unidoc/unidoc" rel="nofollow">https://github.com/unidoc/unidoc</a> is by far the best Go PDF library I have worked with. Its text extractor was just a string extractor last time I checked. It didn't do the hard work of locating characters on a page and building strings from the characters and locations, as PDFBox does.</p></pre>jerf: <pre><p>To the questioner, this is most likely the way to go. It's more than just "parsing" the PDF; PDF is a program whose output is how to put pixels on a screen or ink on a page. Certain projects are so large that you don't necessarily expect every language to put out an individualized solution; this is probably in that category.</p></pre>anacrolix: <pre><p>I gave them a spin. I found rsc/pdf to barf on some optimised PDFs, and unexpected page counts. The rest of the packages weren't appealing. I suspect another language is required here for better support.</p></pre>codegladiator: <pre><p>Don't focus on one particular language for this kind of a problem because the problem is really hard. table/text extraction from pdf is really hard because of the way pdf are created and rendered. its not necessary that the line you get to read in a rendered pdf is necessarily also a line in pdf.</p>
<p>I choose tetpdf(c executable) + tabula (java lib) and invoke them using os proc, getting their output in a file and then reading that file in golang/php, and making my pipeline on top of this extracted file.</p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传