go profiling still broken on OSX?

agolangf · · 555 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>When I profile my Go code under OSX I consistently see runtime.mach_semaphore_signal and runtime.mach_semaphore_wait at the top:</p> <pre><code>69.95s 46.86% 46.86% 69.95s 46.86% runtime.mach_semaphore_signal 55.42s 37.13% 83.99% 55.42s 37.13% runtime.mach_semaphore_wait </code></pre> <p>Same code profiled under Linux doesn&#39;t do this. This is apparently a common symptom of a bug in BSD/OSX, according to <a href="https://github.com/golang/go/issues/6047">https://github.com/golang/go/issues/6047</a>. However the kernel bug that was causing that issue was supposedly fixed in El Capitan.</p> <p>I&#39;m running 10.11.4, with go 1.6.2 but still seeing this issue. Has anyone else run into this problem on El Cap?</p> <hr/>**评论:**<br/><br/>4ad: <pre><p>Yes, it&#39;s broken. It&#39;s subtly broken, you&#39;ll get results with no indication that anything is wrong, but the data is incomplete.</p></pre>pappogeomys: <pre><p>It&#39;s been working for me all along. </p> <p>Does profiling the same code on another system provide anything useful? Are you getting useful results along with these runtime functions, or is this all you get?</p></pre>CaffeineComa: <pre><p>Yes, the Linux version seems to be doing the right thing (telling me essentially that gonum matrix ops are slow, not much I can do about it). On Linux I get something like:</p> <pre><code>227.37s of 256.42s total (88.67%) Dropped 285 nodes (cum &lt;= 1.28s) Showing top 10 nodes out of 53 (cum &gt;= 11.15s) flat flat% sum% cum cum% 48.03s 18.73% 18.73% 62.60s 24.41% github.com/gonum/blas/native.dgemmSerialNotTrans 40.59s 15.83% 34.56% 65.59s 25.58% github.com/gonum/blas/native.dgemmSerialNotNot 37.97s 14.81% 49.37% 37.97s 14.81% github.com/gonum/internal/asm.DaxpyUnitaryTo 25.65s 10.00% 59.37% 38.62s 15.06% github.com/gonum/blas/native.dgemmSerialTransNot 19.52s 7.61% 66.98% 20.12s 7.85% github.com/gonum/matrix/mat64.(*Dense).Add 14.57s 5.68% 72.67% 14.57s 5.68% github.com/gonum/internal/asm.DdotUnitary 14.46s 5.64% 78.31% 14.46s 5.64% runtime.memclr 12.21s 4.76% 83.07% 22.31s 8.70% runtime.stringtoslicerune 9.88s 3.85% 86.92% 9.88s 3.85% runtime.charntorune 4.49s 1.75% 88.67% 11.15s 4.35% github.com/gonum/matrix/mat64.(*Dense).Apply </code></pre> <p>On OSX, it&#39;s:</p> <pre><code>4.23mins of 4.47mins total (94.66%) Dropped 224 nodes (cum &lt;= 0.02mins) Showing top 10 nodes out of 71 (cum &gt;= 0.04mins) flat flat% sum% cum cum% 2.56mins 57.22% 57.22% 2.56mins 57.22% runtime.mach_semaphore_signal 0.64mins 14.31% 71.54% 0.64mins 14.31% runtime.mach_semaphore_wait 0.39mins 8.80% 80.34% 0.59mins 13.24% github.com/gonum/blas/native.dgemmSerialNotNot 0.21mins 4.77% 85.11% 0.21mins 4.77% github.com/gonum/internal/asm.DaxpyUnitaryTo 0.12mins 2.60% 87.71% 0.12mins 2.60% runtime.charntorune 0.10mins 2.18% 89.89% 0.22mins 4.90% runtime.stringtoslicerune 0.10mins 2.18% 92.07% 2.95mins 66.08% github.com/armhold/piston.(*RNN).Run 0.05mins 1.18% 93.25% 0.05mins 1.18% runtime.usleep 0.04mins 0.95% 94.20% 0.62mins 13.95% github.com/gonum/blas/native.dgemmParallel.func1 0.02mins 0.46% 94.66% 0.04mins 0.79% github.com/gonum/blas/native.dgemmSerialTransNot </code></pre></pre>howeman: <pre><p>Interesting. I was seeing the same problem on OSX, but thought it might be garbage collection.</p> <p>Let&#39;s assume for the time being that the profiling bug is fixed, and that instead it&#39;s a problem with the code. The code I mentioned above also had Dgemm as an inner loop. It could be that there&#39;s a scheduling issue on Darwin, and possibly the way gonum is coded is triggering that.</p> <p>Do you have access to OpenBLAS? I don&#39;t have access to a linux box at the moment. I&#39;m curious if the gap between OpenBLAS and gonum/blas/native is the same on both systems, or if it&#39;s much larger on OSX.</p> <p>I can also try and take another pass at the Dgemm code. It&#39;s been a while since I tried optimizing it, and the Go compiler has gotten much better. On the bright side, the upcoming SSA compiler has some substantial speedups. On the down side, I the gap between dgemm and OpenBLAS mat mul is still much larger than I&#39;d like for large matrices. If you happen to be an expert in mat-mul algorithms when the scheduler is roughly FIFO...</p></pre>FiloSottile: <pre><p>To throw a datapoint in the arena, I remember throwing a profile made on El Capitan out of the window because runtime.mach_semaphore_signal and runtime.mach_semaphore_wait were dominating, while they weren&#39;t on Linux.</p> <p>It was a big concurrent internal codebase, with no use of gonum.</p></pre>howeman: <pre><p>Thanks for the data point!</p></pre>CaffeineComa: <pre><p>I was wondering if perhaps the matrix ops were being internally parallelized, and due to some scheduling issue (poor goroutine management, or maybe unlucky distribution of matrix work?) one was finishing early, blocking on the semaphore, while the others chugged away.) I&#39;ve no <em>evidence</em> of that, just a guess. I don&#39;t know why it would behave differently on Linux though.</p> <blockquote> <p>Do you have access to OpenBLAS</p> </blockquote> <p>Not sure what you mean. Are you saying there&#39;s a C binding for Go for OpenBLAS, and I should try that?</p> <p>If you&#39;re interested in source, my project is here: <a href="https://github.com/armhold/rnn" rel="nofollow">https://github.com/armhold/rnn</a>. It&#39;s a recurrent neural net that I translated to Go from Python. I was surprised that it seemed to be about 10x slower than the original. I researched a bit, and a comment by <a href="/u/jerf" rel="nofollow">/u/jerf</a> [1] explains that gonum isn&#39;t able to use SSE, while Numpy is. That explains the general disparity in performance.</p> <p>Still, I wanted to see where it was bottlenecking, hence my profiling. It&#39;s quite possible that I&#39;m doing something less than optimal. I&#39;ve wondered if maybe keeping the matrices around and zeroing them out between uses would be better, but haven&#39;t tried that yet.</p> <blockquote> <p>If you happen to be an expert in mat-mul algorithms when the scheduler is roughly FIFO...</p> </blockquote> <p>Hah, sorry, I&#39;m not at all an expert. I had to reach back 20+ years to my distant linear algebra classes just to get this simple translation working. But if you have suggestions I&#39;d be happy to try them. I don&#39;t want to waste your time though, as this is just a for-fun project and nothing critical.</p> <p>[1] <a href="https://www.reddit.com/r/golang/comments/2f6053/matrix_operations_in_go_inspired_by_numpy/" rel="nofollow">https://www.reddit.com/r/golang/comments/2f6053/matrix_operations_in_go_inspired_by_numpy/</a></p></pre>howeman: <pre><blockquote> <p>I was wondering if perhaps the matrix ops were being internally parallelized, and due to some scheduling issue (poor goroutine management, or maybe unlucky distribution of matrix work?) </p> </blockquote> <p>Matrix multiplication is coded concurrently, but it&#39;s using a worker-queue so nothing should wait that long. The only exception is if you&#39;re multiplying a skinny matrix by a tall matrix. Along the dimension of the multiply is currently being done in serial.</p> <blockquote> <p>I don&#39;t know why it would behave differently on Linux though.</p> </blockquote> <p>Me either. I am curious if it&#39;s a real effect.</p> <blockquote> <p>Not sure what you mean. Are you saying there&#39;s a C binding for Go for OpenBLAS, and I should try that?</p> </blockquote> <p>There is. github.com/gonum/blas/cgo</p> <p>If you have OpenBLAS installed and the CGO_LDFLAGS evironment variable set properly, you can use </p> <pre><code>blas64.Use(cgo.Implementation{}) </code></pre> <p>(be sure you&#39;re importing blas/cgo and not lapack/cgo, though you can also use both depending on the rest of your calls). There are a different set of issues (longer compile times, some of the OS signals don&#39;t work the same), but cgo is definitely faster for decently sized matrix multiplication. </p> <blockquote> <p>I researched a bit, and a comment by <a href="/u/jerf" rel="nofollow">/u/jerf</a> [1] explains that gonum isn&#39;t able to use SSE, while Numpy is. That explains the general disparity in performance.</p> </blockquote> <p>That&#39;s only sort of true. The real truth is that that part of Numpy calls into a BLAS library, such as OpenBLAS. Thus, if you also call into OpenBLAS through cgo, the matrix multiplication times will be almost equivalent. People say that BLAS is coded in Fortran/C, but that&#39;s not really true. It&#39;s mostly coded in Assembly, and uses a lot of processor specific tricks to be fast. Part of that is SSE, and part of that is probably algorithm. The gonum version does in fact use SSE, by also using hand coded assembly, just nowhere near as sophisticated. It&#39;s the go compiler that can&#39;t generate SSE. I don&#39;t think there&#39;s any one thing in particular that drives the speed difference, I think it&#39;s a factor of 2 each from a couple different places. </p></pre>CaffeineComa: <pre><blockquote> <p>The only exception is if you&#39;re multiplying a skinny matrix by a tall matrix.</p> </blockquote> <p>Yeah, I&#39;m doing that unfortunately. The origin Python RNN code does lots of 100x100 * 100x1 math.</p> <blockquote> <p>blas64.Use(cgo.Implementation{})</p> </blockquote> <p>Cool, got it working. It&#39;s about 3x faster now. Thanks!</p></pre>howeman: <pre><blockquote> <p>Yeah, I&#39;m doing that unfortunately. The origin Python RNN code does lots of 100x100 * 100x1 math.</p> </blockquote> <p>That should be moderately parallelized. Our current block size is 64, so that should do 4 multiplications in parallel</p> <blockquote> <p>Cool, got it working. It&#39;s about 3x faster now. Thanks!</p> </blockquote> <p>Great. I really wish ours was fast enough to not have to do that. Do be careful with signals though, for example if my computer goes to sleep with cgo sometimes it crashes, while go always restarts just fine. </p> <p>Feel free to ask over at gonum-dev if you have other questions/comments/suggestions</p></pre>CaffeineComa: <pre><p>Also, FWIW, all of my code is single-threaded (go-routined? not sure what the term is here in Go-land). If there&#39;s any go-routines happening besides main, it&#39;s coming from gonum. </p></pre>howeman: <pre><p>Yea, matrix multiplication is coded concurrently. If you set GOMAXPROCS to 1, I think the runtime sem stuff will go away. </p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

555 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传