Implementing a streaming CSV reader; reduced allocs from 85,000 to 3 for a 5K-line CSV file

polaris · · 403 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>By reusing a single buffer across all rows in a CSV file and using <code>[]byte</code> as the cell type instead of <code>string</code>, I was able to improve the parsing speed by ~40% and cut the allocations from 85K to 3. This memory improvement won&#39;t provide much benefit if you&#39;re going to throw everything into a <code>[][]string</code> anyway, but if you just need to stream through the rows, it does a good job. I&#39;m still not sure if I&#39;m going to make this into a serious project or not (it&#39;s not at feature parity with the std lib csv reader, and bringing it up to parity will likely degrade execution speed a bit, though probably not memory consumption), but I thought it was an interesting proof of concept.</p> <pre><code>BenchmarkStdCsv-4 50 28297705 ns/op 2506375 B/op 85015 allocs/op BenchmarkMyCsv-4 100 16865296 ns/op 11452 B/op 3 allocs/op </code></pre> <hr/>**评论:**<br/><br/>ChristophBerger: <pre><blockquote> <p>bringing it up to parity will likely degrade execution speed a bit</p> </blockquote> <p>It is perfectly valid to keep the feature set reduced in order to enavle or maintain high performance. </p></pre>stevvooe: <pre><p>You might try suggesting this a change to the standard library. It looks like the main difference is using <code>[]byte</code> over <code>string</code>.</p> <p>One thing you could do to avoid ownership problems with the return value is allow the caller to pass in the field buffer during the read and support an error protocol to signal that the caller needs to allocate more space. Something like this might work:</p> <p>(r *CSVReader) ReadFields(fields...[]byte) (nfields int, err error)</p> <p>Usage of this ends up looking something like <a href="https://play.golang.org/p/WqI9LbJgm-" rel="nofollow">this</a>.</p></pre>weberc2: <pre><p>There is a ticket to use []byte in the std lib, but as of now there is no plan to give user control over the allocations. I&#39;ve considered passing in the byte slices as you suggest, but if one of the fields exceeds the capacity of its provided byte slice, a new slice would need to be allocated and put in its position. The user will have to take care to understand that the slices they provided might not be the ones that contain their data. This seems strange and possibly error prone so I&#39;ll have to think more on it. I do want to give the user a better way to control allocs though.</p></pre>stevvooe: <pre><p>Because you are passing a slice of slices, you could just as well set those to point at the internal buffer, rather than copy. The problem with this approach is that you&#39;re returning internal buffers, which can be dangerous.</p> <blockquote> <p>The user will have to take care to understand that the slices they provided might not be the ones that contain their data.</p> </blockquote> <p>I&#39;m not sure I understand this fear. Either way, you end having to <a href="https://bitbucket.org/snippets/weberc2/goEb7#csv.go-40" rel="nofollow">allocate when encountering a large field</a>. With the API proposed above, the user could control when this allocation happens, or let the call fix it up with an <code>append</code>, as shown in the example. On average, there shouldn&#39;t be any allocations once the largest fields have been encountered and there will never be more memory allocated that it takes to handle single row.</p></pre>weberc2: <pre><blockquote> <p>Because you are passing a slice of slices, you could just as well set those to point at the internal buffer, rather than copy. The problem with this approach is that you&#39;re returning internal buffers, which can be dangerous.</p> </blockquote> <p>I could do that, but that doesn&#39;t give the caller control over the allocations. Anyway, there&#39;s nothing particularly dangerous about returning the internal buffer. That&#39;s what bufio.Scanner.Bytes() does.</p> <blockquote> <p>I&#39;m not sure I understand this fear. Either way, you end having to allocate when encountering a large field.</p> </blockquote> <p>My fear isn&#39;t that I&#39;ll have to allocate; it&#39;s that <em>when</em> I have to allocate, I&#39;ll no longer be using the slice provided me by the caller for that field, but I&#39;ll have to allocate a new slice (and put the new slice at the same position in the outer slice). The caller will have to take care not to assume that the slices they passed were modified in place, since this sometimes won&#39;t be the case. Perhaps my fear is misplaced, hence the &#34;I&#39;ll have to think about it&#34; bit.</p> <blockquote> <p>With the API proposed above, the user could control when this allocation happens, or let the call fix it up with an append, as shown in the example.</p> </blockquote> <p>I don&#39;t see how the API you provided allows the caller to control when the allocation happens. If the user provides a slice that is too small for the field, the append will allocate a bigger buffer and copy the old buffer contents over. The caller&#39;s only &#34;control&#34; is providing an adequately large buffer.</p> <blockquote> <p>On average, there shouldn&#39;t be any allocations once the largest fields have been encountered and there will never be more memory allocated that it takes to handle single row.</p> </blockquote> <p>Yes, this is the same strategy I&#39;m employing. The mechanics are almost the same (minimizing allocs, etc), the biggest difference is the ability for the caller to control allocations (namely that my implementation doesn&#39;t give the user that power). I will probably move to an interface like yours, I just need to think it through first.</p></pre>crowl91: <pre><p>will you share your code?</p></pre>weberc2: <pre><p>Sure, it&#39;s a prototype, but I&#39;m open to suggestions.</p> <p><a href="https://bitbucket.org/snippets/weberc2/goEb7" rel="nofollow">https://bitbucket.org/snippets/weberc2/goEb7</a></p></pre>bonekeeper: <pre><p>I always thought that streaming CSV reading should be in the standard library in the first place (feeding back through a channel so you can range over it, etc). Maybe add that as another helper to your version?</p></pre>weberc2: <pre><p>I agree, though I&#39;m not sure about the channel bit; I&#39;ll have to profile to see the performance implications. I was thinking a scanner-like interface might be best, then a caller can always wrap it in a channel interface.</p></pre>fabstu: <pre><p>just want to chime in, a channel is less performant than a mutex.</p></pre>barsonme: <pre><p>To elaborate: internally, channels use locks. So, using a channel when a mutex would suffice is trading performance for (perhaps) ease of use.</p></pre>weberc2: <pre><p>This doesn&#39;t require synchronization of any kind, so the only benefit to using channels would be perceived ergonomics (being able to use <code>range</code>), but I&#39;ve never found channels to be as ergonomic as scanners/iterators even though there is no syntax support for those patterns. Never mind the performance hit.</p></pre>The_Sly_Marbo: <pre><p>Perhaps a similar API to <code>bufio.Scanner</code>?</p> <p>scanner := csv.NewScanner(r) for scanner.Scan() { line := scanner.Row() ... }</p> <p>if err := scanner.Err(); err != nil { ... }</p></pre>weberc2: <pre><p>Yeah, that&#39;s what I meant by a scanner-like interface. :)</p></pre>downvotes_puffins: <pre><p>This is exactly the kind of problem where C++ excels: the programmer has detailed control of memory allocations and de-allocations.</p> <p>For a problem like this, the <code>std::getline</code> interface actually encourages performant code, with a single, re-used buffer. Throw in some <code>string_view</code> objects to give views into the buffer, plus some template functions for conversion (ahem, generics).</p></pre>weberc2: <pre><p>If a fast CSV parser were the primary criteria for this project, I might have chosen C++.</p> <p>EDIT: Controlling allocations via profiling in Go has been surprisingly easy.</p></pre>ar1819: <pre><p>As a fellow C++ developer, I fail to see how this is relevant in any way. It&#39;s about object reusage and not about controlling allocations \ deallocations. Go allows this, Btw getting memory from OS and returning it back, is actually more expensive then getting it from managed heap. Yes - on some scenarios languages with GC are actually faster.</p> <p>So, to say again, I see no point in your comment.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

403 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传