On the 'binary-trees' benchmark

polaris · · 465 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>When looking at the <a href="http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=binarytrees" rel="nofollow">http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=binarytrees</a> benchmark, one could get the impression that Go has some really big memory allocation problems. Go clocking in at 39 seconds, when Java do the same in 11 seconds, Rust in 4 seconds, and C in 3 seconds.</p> <p>However, when I started looking at the source, I find it highly &#34;unfair&#34; that C and Rust seem to use &#34;arenas&#34;, which basically is pre-allocated memory, if I understand it right, while Go seems to be using more standard allocation. I guess that if similar code was in a hot spot in Go, one would pre-allocate all Nodes with make(Node, n) or similar.</p> <p>What surprises me though is that it seems that Java is very efficient at this without any special library approach, though I have heard their hotspot compiler is good. Is there any hope that Go might get to these levels?</p> <hr/>**评论:**<br/><br/>SportingSnow21: <pre><p>The benchmark is terribly flawed. You&#39;re only allowed to get clever if you can allocate raw memory using a library like the C/Rust/C++ pools (Java is allowed JIT warm-up, so the compiler will collapse those allocations behind the scenes). That&#39;s what all the best benchmarks are doing, collapsing expensive allocations via pool allocation of raw memory. Since the whole benchmark is supposed to be about allocation &amp; free costs, I don&#39;t even consider the pool allocators as valid data points. </p> <p>Go doesn&#39;t really work in raw memory outside of super-hacky runtime/asm tricks, so allocating a slab of Nodes to use in building the tree isn&#39;t allowed. The best apples-to-apples comparison is <code>C++ g++ #2</code> code at 39.12 seconds, which does the same number of small allocations/frees as the <code>Go #4</code> code at 39.88 seconds. ~750 ms difference for a GC/runtime language is a solid performance comparison. </p> <p>If this was a hot-spot, there would be a single allocation call before the first call to <code>bottomUpTree</code> and the tree would be built out of that slice of Nodes. </p></pre>weberc2: <pre><p>Can you show me where Java is allowed JIT warm-up? I&#39;ve not been able to find any information yay or nay on this except these comments on HN with outdated links: <a href="https://news.ycombinator.com/item?id=8436723" rel="nofollow">https://news.ycombinator.com/item?id=8436723</a></p></pre>igouy: <pre><p><a href="http://benchmarksgame.alioth.debian.org/how-programs-are-measured.html#time" rel="nofollow">http://benchmarksgame.alioth.debian.org/how-programs-are-measured.html#time</a></p></pre>weberc2: <pre><p>That seems to indicate that the JIT&#39;s warmup time is included in the benchmark. Am I misreading?</p></pre>igouy: <pre><blockquote> <p>The benchmark is terribly flawed.</p> </blockquote> <p>And yet you found an <em>&#34;apples-to-apples comparison&#34;</em> ;-)</p> <blockquote> <p>Java is allowed JIT warm-up</p> </blockquote> <p><a href="http://benchmarksgame.alioth.debian.org/how-programs-are-measured.html#time" rel="nofollow">Not true</a>, never has been.</p></pre>SportingSnow21: <pre><blockquote> <p>And yet you found an &#34;apples-to-apples comparison&#34; ;-)</p> </blockquote> <p>A flawed design doesn&#39;t invalidate data points, it invalidates extrapolated conclusions that require the design&#39;s assumption. </p> <blockquote> <p>Use default GC, use per node allocation or use a library memory pool. </p> </blockquote> <p>So, in order to draw valid conclusions about &#34;per node allocation&#34;, we can only draw conclusions from the data points that adhere to this requirement. A proper test would segregate the per-node and memory pool data, because the meaning of the data is completely different. As the code <a href="/u/VifArdente" rel="nofollow">/u/VifArdente</a> posted shows, a memory pool completely changes the benchmark; an order of magnitude change in times for the same language and compiler should be a huge flag that the data has caveats.</p> <blockquote> <p>Not true, never has been. </p> </blockquote> <p>Cool, that&#39;s changed. They used to run all the benchmarks in one session, too. Granted, I first dug into these benchmarks a decade ago, so I should have expected some changes. I usually compare source code before comparing numbers, anyway.</p></pre>igouy: <pre><blockquote> <p>extrapolated conclusions</p> </blockquote> <p>Where are those extrapolated conclusions on the website?</p> <blockquote> <p>Cool, that&#39;s changed</p> </blockquote> <p>No! That is not true. Java was not allowed JIT warmup.</p> <p>The Perl scripts Doug Bagley used did not allow it; the Perl scripts Brent Fulgham used did not allow it; the Python scripts I use do not allow it.</p></pre>SportingSnow21: <pre><blockquote> <p>Where are those extrapolated conclusions on the website?</p> </blockquote> <p>The data is presented as fulfilling the same requirements, ranked and compared directly. The only way to identify what the time actually means is to look at the code. So, the extrapolated conclusions are the presentation that does nothing to show important details. How are people like OP expected to come to any other conclusion than &#34; Go clocking in at 39 seconds, when Java do the same in 11 seconds, Rust in 4 seconds, and C in 3 seconds.&#34;? </p> <blockquote> <p>No! That is not true. Java was not allowed JIT warmup. </p> </blockquote> <p>I&#39;m not trying to argue history with you, it&#39;s been a damn decade. It doesn&#39;t change how we read the results from this specific benchmark. JVM is collapsing those small allocations behind the scenes. That&#39;s not a bad thing! That&#39;s a solid piece of engineering! But, in order to make valid comparisons we shouldn&#39;t be comparing code that&#39;s allowed to combine allocations with code that&#39;s not. My memory might suck about the exact setup of these benchmarks, but the lesson learned is still valid.</p></pre>tscs37: <pre><p>The best you can do is preallocate memory.</p> <p>I&#39;ve run into this problem too when I worked on a little protocol to transfer whole byte-slices over network.</p> <p>When you send a lot of data over wire, most time is spent for allocating the memory in Go.</p> <p>IIRC it was total runtime of about 400ms and 380ms spent in the allocator trying to make room for all the data. Then 0.something ms decoding the header and the remaining 20ms are basically raw transfer. The worst is that doubling the data amount only doubles the raw transfer but increases the amount spen tin the allocator by a lot more (to about 1.2 seconds).</p> <p>In my personal experience, Go does not like allocating a lot of memory. And when writing application I try to reuse allocated memory for big objects (like an image or thumbnail), but I guess that is true for any language.</p> <p>I&#39;m not sure if there might be workaround, but this is what I&#39;ve personally found in a few limited experiences working with high-throughput stuff.</p></pre>weberc2: <pre><blockquote> <p>In my personal experience, Go does not like allocating a lot of memory.</p> </blockquote> <p>Do you mean &#34;Go doesn&#39;t like allocating large blocks of memory&#34;? Or &#34;Go doesn&#39;t like many allocations&#34;? I understand the latter is true, but I&#39;ve not heard anything negative about the former.</p></pre>tscs37: <pre><p>allocating about 1.2GB of memory in go takes quite a long time.</p> <p>This was actually after optimizing from making many small allocations, which was many times worse.</p> <p>In general; both are a bit bad on go, allocations take a long time compared to how fast other languages can do it.</p></pre>weberc2: <pre><p>Interesting. Why does it take so long to preallocate memory? Is it the zeroing-out?</p></pre>tscs37: <pre><p>I&#39;m not exactly sure why it takes so long, it&#39;s probably some underlying mechanic.</p> <p>It would be nice to have some compiler flag to make allocation faster at some other cost.</p></pre>weberc2: <pre><p>Not sure what you mean by &#34;quite a long time&#34;, but I can allocate that on my MBP in 71 ms with Go 1.7. Is this too large, or were you observing a much larger duration?</p></pre>tscs37: <pre><p>Well, I did only work on it in Go 1.6, maybe they made some stuff go faster.</p> <p>I can only tell what I&#39;ve found.</p></pre>weberc2: <pre><p>Hmm, I got 75ms on 1.6. I wonder if there wasn&#39;t some other problem with your benchmark?</p></pre>VifArdente: <pre><p>Go version with pools (arenas): <a href="https://groups.google.com/d/msg/golang-nuts/sy8IuQQOQSI/MtDor18hCQAJ" rel="nofollow">https://groups.google.com/d/msg/golang-nuts/sy8IuQQOQSI/MtDor18hCQAJ</a></p></pre>FantomLancer: <pre><p>Wow, amazing improvement! Just shows the extent of possible optimizations! Seems like Go is unfairly treated in the benchmark shootout when they reject this code.</p></pre>VifArdente: <pre><p>That was just an exploration of Go&#39;s ability to handle offheap objects, I never submitted it to shootout.</p></pre>FantomLancer: <pre><p>Care if I do?</p></pre>weberc2: <pre><p>From the application instructions:</p> <blockquote> <p>Please don&#39;t implement your own custom &#34;arena&#34; or &#34;memory pool&#34; or &#34;free list&#34; - they will not be accepted.</p> </blockquote> <p><a href="http://benchmarksgame.alioth.debian.org/u64q/binarytrees-description.html#binarytrees" rel="nofollow">http://benchmarksgame.alioth.debian.org/u64q/binarytrees-description.html#binarytrees</a></p></pre>FantomLancer: <pre><p>When is it &#34;custom&#34;? This must be open to interpretation. I mean, if something similar to <a href="https://github.com/pi/goal/blob/master/gut/mempool.go" rel="nofollow">https://github.com/pi/goal/blob/master/gut/mempool.go</a> was in some part of the golibrary, this benchmark would be OK?</p> <p>I guess SportingSnow21 is spot on when he questions the validity of pool allocators in this benchmark for ANY language.</p></pre>weberc2: <pre><p>Yes, I think if it were part of the standard library, and I agree that pool allocators in any language defeat the usefulness of the benchmark.</p></pre>VifArdente: <pre><p>It&#39;s a modification of original shootout code so I have no objections. But I doubt it will be accepted.</p></pre>igouy: <pre><p>You aren&#39;t the author.</p></pre>mixedCase_: <pre><p>Uh... the author replied to him before you stating it was okay, what was the need to mention this?</p></pre>igouy: <pre><p>It&#39;s the first reason that it will not be accepted.</p></pre>igouy: <pre><blockquote> <p>…it seems that Java is very efficient at this without any special library approach…</p> </blockquote> <p>And without using any of the myriad GC options that are available.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

465 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传