Parsing the PDF documents in Golang

agolangf · · 727 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I need to be able to extract text / recognize tables (something pretty easy with PDFBox or iText in Java). So far I found some libs:</p> <ul> <li><a href="https://github.com/rsc/pdf">https://github.com/rsc/pdf</a></li> <li><a href="https://github.com/yob/pdfreader">https://github.com/yob/pdfreader</a></li> <li><a href="https://github.com/hhrutter/pdfcpu">https://github.com/hhrutter/pdfcpu</a></li> </ul> <p>But want some real users to share their experience / recommend something.</p> <p>Thanks!</p> <hr/>**评论:**<br/><br/>thucle: <pre><p>I have forked from rsc/pdf to <a href="https://github.com/ledongthuc/pdf" rel="nofollow">https://github.com/ledongthuc/pdf</a> and support some funcs to read data from PDF. The idea&#39;s I want to get data from Linkedin Profile. So hope it&#39;s useful for you</p></pre>peterwilliams97: <pre><p>I need to do this too. I couldn&#39;t find any Go libraries that have come close to doing the considerable work PDFBox has done on text extraction. Exec&#39;ing PDFBox from a Go program works fine so I am not worried about the lack of a native Go library for doing this.</p> <p><a href="https://github.com/unidoc/unidoc" rel="nofollow">https://github.com/unidoc/unidoc</a> is by far the best Go PDF library I have worked with. Its text extractor was just a string extractor last time I checked. It didn&#39;t do the hard work of locating characters on a page and building strings from the characters and locations, as PDFBox does.</p></pre>jerf: <pre><p>To the questioner, this is most likely the way to go. It&#39;s more than just &#34;parsing&#34; the PDF; PDF is a program whose output is how to put pixels on a screen or ink on a page. Certain projects are so large that you don&#39;t necessarily expect every language to put out an individualized solution; this is probably in that category.</p></pre>anacrolix: <pre><p>I gave them a spin. I found rsc/pdf to barf on some optimised PDFs, and unexpected page counts. The rest of the packages weren&#39;t appealing. I suspect another language is required here for better support.</p></pre>codegladiator: <pre><p>Don&#39;t focus on one particular language for this kind of a problem because the problem is really hard. table/text extraction from pdf is really hard because of the way pdf are created and rendered. its not necessary that the line you get to read in a rendered pdf is necessarily also a line in pdf.</p> <p>I choose tetpdf(c executable) + tabula (java lib) and invoke them using os proc, getting their output in a file and then reading that file in golang/php, and making my pipeline on top of this extracted file.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

727 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传