Trouble parallelizing go routines (on a Pi 3)

polaris · · 208 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I know, concurrency is not parallel :) But I assumed Go routines would at least fill up available CPU cores. So made this little test on a Raspberry Pi 3 where I find primes in serial and using Go-Routines.</p> <p>The latter is much slower - which caught me by surprise.</p> <p>Just doing it in serial (primes from 1 to 1.6 million) took 5.6 seconds, whereas using Go routines took a whopping 13.8 seconds.</p> <p>I see when running the program, the CPU meter do not budge much over 25% as well.</p> <p>So I must have done something wrong somewhere!?</p> <p><a href="">source and more here</a></p> <p>Any help figuring this out appreciated :)</p> <hr/>**评论:**<br/><br/>faiface: <pre><p>I think the problem is that the sequential version uses the cache much better than the concurrent version (one local for-loop vs sending values here and there).</p> <p>Try rewriting your code in such a way, that you fire up multiple goroutines, which won&#39;t actively communicate, instead, let each of them process a sub-interval of all of the numbers you want to check. This way, each goroutine will fully utilize cache and should achieve much better performance.</p> <p>At the end, aggregate the results from each goroutine.</p></pre>nathankerr: <pre><p>It&#39;s pretty common for a concurrent implementation to slow things down. I wrote about just such an experience and then detailed how I used the tools provided by Go to speed it up: <a href=""></a></p> <p>My first impression of your concurrent version is the channel operations you use are not needed. You would also probably get better performance if you used one fewer instance of isPrimeConc because primeConcurrent is also in a goroutine that has processing (sending and receiving) for every step.</p> <p>I will be back in a bit with my own version.</p></pre>nathankerr: <pre><p>Since I wanted to use <a href="" title="benchstat" rel="nofollow">benchstat</a>, the functions being benchmarked needed to not output their progress (benchstat can’t parse the output if they do). So I moved the output and timing stuff to <code>main</code> and had the prime functions return the number of primes found.</p> <p>I then found the minimal scalable unit (MSU) as described in the article and split the work apart in a similar way because the only interaction between the MSUs is the count of prime numbers found. I used the original <code>isPrime</code> for both the sequential and concurrent versions.</p> <h2><code>piPrime.go</code></h2> <pre><code>package main import ( &#34;fmt&#34; &#34;math&#34; &#34;runtime&#34; &#34;sync&#34; &#34;time&#34; ) // set to 1.6 million - not 16 million - to save some time when debugging // I made this a global so the benchmarks use the same number const countTo uint64 = 1600000 func main() { fmt.Println(&#34;here we go. Ol&#39;skool serial prime-finder&#34;) start := time.Now() primes := primeSequential(countTo) fmt.Printf(&#34;%8d || time: %6.2f || primes: %d\n&#34;, countTo, time.Since(start).Seconds(), primes) fmt.Println(&#34;and now the concurrent way!&#34;) start = time.Now() primes = primeConcurrent(countTo) fmt.Printf(&#34;%8d || time: %6.2f || primes: %d\n&#34;, countTo, time.Since(start).Seconds(), primes) } func primeSequential(limit uint64) int32 { var primeCount int32 for i := uint64(2); i &lt; limit; i++ { if isPrime(i) { primeCount++ } } return primeCount } func primeConcurrent(limit uint64) int32 { mu := &amp;sync.Mutex{} // protects primeCount var primeCount int32 wg := &amp;sync.WaitGroup{} nworkers := uint64(runtime.GOMAXPROCS(-1)) work := limit / nworkers for i := uint64(0); i &lt; nworkers; i++ { begin := work * i end := begin + work switch i { case 0: // start at 2 begin = 2 case nworkers - 1: // last worker picks up the slack end = limit } wg.Add(1) go func() { defer wg.Done() for i := begin; i &lt; end; i++ { if isPrime(i) { mu.Lock() primeCount++ mu.Unlock() } } }() } wg.Wait() return primeCount } func isPrime(number uint64) bool { var i uint64 var s uint64 s = uint64(math.Sqrt(float64(number))) if number &gt; 2 &amp;&amp; number%2 == 0 { return false } for i = 3; i &lt;= s; i += 2 { if number%i == 0 { return false } } return true } </code></pre> <h2><code>piPrime_test.go</code></h2> <pre><code>package main import &#34;testing&#34; func BenchmarkPrimes(b *testing.B) { for i := 0; i &lt; b.N; i++ { // primeSequential(countTo) primeConcurrent(countTo) } } </code></pre> <h2>Benchmarks</h2> <p>I ran the benchmarks with:</p> <pre><code>go test -bench Primes -count 10 |tee concurrent.txt </code></pre> <p>Switching the comment in the benchmark and <code>tee</code>ing to the correct file for the version I wanted to run and ended up with:</p> <pre><code>$ benchstat sequential.txt concurrent.txt name old time/op new time/op delta Primes-2 1.01s ± 1% 0.62s ± 1% -38.46% (p=0.000 n=9+8) </code></pre> <h2>Further Optimizations</h2> <p>Optimizations I can think of (without having profiled them):</p> <ul> <li>workers should aggregate primes locally, then update <code>primeCount</code> once at the end; this should reduce mutex contention</li> <li><code>isPrime</code> should calculate <code>Sqrt</code> after the even check is done</li> </ul></pre>siritinga: <pre><p>Your problem is that you are feeding 4 values to 4 goroutines and then waiting for the reply to send other 4 values, starving the goroutines.</p> <p>I&#39;ve changed your code to reuse isPrime in both versions, creating a goroutine that feeds the prime goroutines and also using time.Time to compute durations.</p> <p><a href="" rel="nofollow"></a></p> <p>In my PC I can see a clear improvement for big values, however for small values the overhead of channels doesn&#39;t worth the effort of parallelizing.</p> <p>Regards.</p></pre>Zy14rk: <pre><p>Thanks - that worked very nicely. Uploaded the new version to the repo with credit going to you where it is due :) </p> <p>I got to remember that pattern of using go-routines. Very nice. Thanks again!</p></pre>joushou: <pre><p>I&#39;ve left a comment elsewhere, but batching (split range in 4, instead of looping over it) and only sending primes back (with a &#34;done&#34; message at the end) would give a very drastic boost as well.</p></pre>joushou: <pre><p>It&#39;s slow because of two things:</p> <ol> <li>You&#39;re sending the result for every integer from all the goroutines down a channel that has a tiny 100 number buffer, to be consumed by 1 goroutine. That means that your code cannot run any faster than that single goroutine can consume results. Your workers are waiting to get rid of their result. Send fewer results (batch, only send on prime, ...).</li> <li>You&#39;re sending too small tasks to the goroutines - your primitive isPrime function only becomes expensive once the number is above a few hundred thousand, and for most of your run, sending the request to the worker is harder than the task itself. This means that your workers are waiting to get new requests, rather than spending time working. Instead, batch your requests (send a <em>range</em>, not a number).</li> </ol> <p>There are way more things to consider, and concurrent code will always have a penalty for its concurrency, but a good thing to consider is that you want your workers to run on their own, without talking to anyone, for as long as possible.</p></pre>epiris: <pre><p>To start use the time package instead of that milis stuff, it&#39;s probably not even accurate.</p> <pre><code>start := time.Now() fmt.Println(time.Now()-start) // 14ms </code></pre> <p>As sort of mentioned in another post, though not due to caching, it&#39;s not written to benefit from concurrency. It&#39;s just a sequential loop with the added overhead of synchronization. Instead make a Range struct that sends a worker a large batch of numbers to find. I.e:</p> <pre><code>type Range struct { start, end } for I &lt; 16m, I+= batchsize queue &lt;- Range{I, I + batchSize } </code></pre> <p>Now you can remove a total of 16*n million runtime scheduler events needed that pause your goroutine, including accepting results, sending results, etc. Now that you are working on larger ranges you can ditch trial division, it is the slowest method for finding primes for larger numbers. Since your script goes to 16 million you should use sieve of atkin for larger ranges or eratosthenes (check spelling) since they are much more efficient for large numbers. At this point things should be much faster than the sequential.</p></pre>dlsniper: <pre><p>Read and apply this <a href="" rel="nofollow"></a> and <a href="" rel="nofollow"></a></p></pre>nevyn: <pre><p>I&#39;d start with what others have done:</p> <p><a href="" rel="nofollow"></a></p> <p>Eg.</p> <p><a href="" rel="nofollow"></a></p></pre>NoEffex: <pre><p>I think you need to set the number of cores to use. </p> <p><code>import &#34;runtime&#34; runtime.GOMAXPROCS(runtime.NumCPU())</code></p> <p>This allows GO to use all cpu threads</p></pre>dlsniper: <pre><p>You don&#39;t need to this since a few releases now, iirc Go 1.4 or 1.5.</p></pre>NoEffex: <pre><p>Oh right I didn&#39;t know that</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

208 次点击  
加入收藏 微博
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传