Is there any way to disable multiline CSV parsing in encoding/csv?

polaris · · 532 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I have a large (52G, 2.1 billion lines) CSV file that I&#39;d like to read through and transform. I&#39;m using the standard encoding/csv package to read through the file but there is one particular line, about 11 million lines in, that looks like <code>104.193.255.92,&#34;lucas</code>. While doing a reader.Read() on that, it takes that line and keeps adding to it until the application runs out of memory and dies. This happens because newlines and commas may be included in a quoted-field, as per the <a href="https://golang.org/pkg/encoding/csv/" rel="nofollow">docs</a>. </p> <p>Is there any way to disable multiline csv parsing? How can I handle such a situation?</p> <hr/>**评论:**<br/><br/>balloonanimalfarm: <pre><p>Since it&#39;s your CSV file that&#39;s the problem, just write a wrapper around io.Reader that checks for a mismatched number of quotes on the line. It can remove them from the line before passing on the data.</p></pre>mwholt: <pre><p>That CSV file is malformed. An overly-generous CSV parser may accommodate somehow, but here&#39;s what you should do: fix the file.</p> <p>Just as you wouldn&#39;t expect a compiler to assume you meant something else when it encounters a syntax error, neither should you expect your CSV parser to assume you meant something else when there&#39;s malformed input.</p></pre>tivalt: <pre><p>As someone who deal&#39;s with lexing/parsing almost daily, this is the best answer, just fix the file. The only alternative is spending hours trying to code around the issue, which is fine if you feel like that&#39;d be a good exercise or if the this file needs to be fed into an automated process and is updated all the time.</p></pre>thepciet: <pre><p>Fix the 52GB file, or roll a custom comma separated value dialect parser. CSV is easy and probably a good Go learning exercise. Just add that special case for quote and comma interactions and hope it doesn&#39;t break somewhere else.</p> <p>Depends on the situation, fix the source so it creates right CSV is ideal of course.</p></pre>mwholt: <pre><blockquote> <p>CSV is easy</p> </blockquote> <p>After writing <a href="http://papaparse.com" rel="nofollow">Papa Parse</a>, I disagree.</p> <p>I would fix the file, but:</p> <blockquote> <p>roll a custom comma separated value dialect parser</p> </blockquote> <p>this is a good plan B.</p></pre>albatr0s: <pre><p>You could read line by line and feed it to the CSV parser, that way it will never have more than just one line to read. :-)</p></pre>kylewolfe: <pre><p>I&#39;m assuming you&#39;ve turned on LazyQuotes and are running into this scenario? <a href="http://play.golang.org/p/jrdy1_zgpl" rel="nofollow">http://play.golang.org/p/jrdy1_zgpl</a></p> <p>Are there normally qualifiers (double quote) or is it just this record that causes the error because it happens to have a qualifier as part of the value?</p></pre>kylewolfe: <pre><p>What I was getting to is that you can roll your own reader: <a href="http://play.golang.org/p/NR-HcgCeZ_" rel="nofollow">http://play.golang.org/p/NR-HcgCeZ_</a></p></pre>Blufalcon94: <pre><p>One of the problems I can foresee with rolling my own reader is that if some lines might have something like: <code>hello, &#34;world, test&#34;</code>, the standard csv reader would parse that as <code>[hello world,test]</code>, two fields whereas the link you provided would parse the commas that are in the quotes as separate fields. This is just one issue I can think of now but there may be others down the line</p></pre>Blufalcon94: <pre><p>LazyQuotes wouldn&#39;t really help here. The file looks like:</p> <blockquote> <p>104.177.107.230,104-177-107-230.lightspeed.frokca.sbcglobal.net 104.177.107.231,104-177-107-231.lightspeed.frokca.sbcglobal.net 104.177.107.232,104-177-107-232.lightspeed.frokca.sbcglobal.net 104.193.255.92,&#34;lucas</p> <p>... It&#39;s just this record that causes this issue because it has a qualifier and the reader will keep reading until the next qualifier. </p> </blockquote></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

532 次点击  
加入收藏 微博
0 回复
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传