Parsing the PDF documents in Golang

agolangf · · 799 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

I need to be able to extract text / recognize tables (something pretty easy with PDFBox or iText in Java). So far I found some libs: <ul> <li><a href="https://github.com/rsc/pdf">https://github.com/rsc/pdf</a></li> <li><a href="https://github.com/yob/pdfreader">https://github.com/yob/pdfreader</a></li> <li><a href="https://github.com/hhrutter/pdfcpu">https://github.com/hhrutter/pdfcpu</a></li> </ul> But want some real users to share their experience / recommend something. Thanks! <hr/>**评论：** thucle: <pre>I have forked from rsc/pdf to <a href="https://github.com/ledongthuc/pdf" rel="nofollow">https://github.com/ledongthuc/pdf</a> and support some funcs to read data from PDF. The idea's I want to get data from Linkedin Profile. So hope it's useful for you</pre>peterwilliams97: <pre>I need to do this too. I couldn't find any Go libraries that have come close to doing the considerable work PDFBox has done on text extraction. Exec'ing PDFBox from a Go program works fine so I am not worried about the lack of a native Go library for doing this. <a href="https://github.com/unidoc/unidoc" rel="nofollow">https://github.com/unidoc/unidoc</a> is by far the best Go PDF library I have worked with. Its text extractor was just a string extractor last time I checked. It didn't do the hard work of locating characters on a page and building strings from the characters and locations, as PDFBox does.</pre>jerf: <pre>To the questioner, this is most likely the way to go. It's more than just "parsing" the PDF; PDF is a program whose output is how to put pixels on a screen or ink on a page. Certain projects are so large that you don't necessarily expect every language to put out an individualized solution; this is probably in that category.</pre>anacrolix: <pre>I gave them a spin. I found rsc/pdf to barf on some optimised PDFs, and unexpected page counts. The rest of the packages weren't appealing. I suspect another language is required here for better support.</pre>codegladiator: <pre>Don't focus on one particular language for this kind of a problem because the problem is really hard. table/text extraction from pdf is really hard because of the way pdf are created and rendered. its not necessary that the line you get to read in a rendered pdf is necessarily also a line in pdf. I choose tetpdf(c executable) + tabula (java lib) and invoke them using os proc, getting their output in a file and then reading that file in golang/php, and making my pipeline on top of this extracted file.</pre>

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

799 次点击

加入收藏微博

github

java

php

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Parsing the PDF documents in Golang

用户登录

今日阅读排行

一周阅读排行

最新主题