Trouble parallelizing go routines (on a Pi 3)

polaris · 2017-05-21 21:00:02 · 636 次点击

这是一个分享于 2017-05-21 21:00:02 的资源，其中的信息可能已经有所发展或是发生改变。

I know, concurrency is not parallel But I assumed Go routines would at least fill up available CPU cores. So made this little test on a Raspberry Pi 3 where I find primes in serial and using Go-Routines.

The latter is much slower - which caught me by surprise.

Just doing it in serial (primes from 1 to 1.6 million) took 5.6 seconds, whereas using Go routines took a whopping 13.8 seconds.

I see when running the program, the CPU meter do not budge much over 25% as well.

So I must have done something wrong somewhere!?

source and more here

Any help figuring this out appreciated

评论：

faiface:

I think the problem is that the sequential version uses the cache much better than the concurrent version (one local for-loop vs sending values here and there).

Try rewriting your code in such a way, that you fire up multiple goroutines, which won't actively communicate, instead, let each of them process a sub-interval of all of the numbers you want to check. This way, each goroutine will fully utilize cache and should achieve much better performance.

At the end, aggregate the results from each goroutine.

nathankerr:

It's pretty common for a concurrent implementation to slow things down. I wrote about just such an experience and then detailed how I used the tools provided by Go to speed it up: https://pocketgophers.com/concurrency-slower/

My first impression of your concurrent version is the channel operations you use are not needed. You would also probably get better performance if you used one fewer instance of isPrimeConc because primeConcurrent is also in a goroutine that has processing (sending and receiving) for every step.

I will be back in a bit with my own version.

nathankerr:

Since I wanted to use benchstat, the functions being benchmarked needed to not output their progress (benchstat can’t parse the output if they do). So I moved the output and timing stuff to main and had the prime functions return the number of primes found.

I then found the minimal scalable unit (MSU) as described in the article and split the work apart in a similar way because the only interaction between the MSUs is the count of prime numbers found. I used the original isPrime for both the sequential and concurrent versions.

piPrime.go

package main
import (
    "fmt"
    "math"
    "runtime"
    "sync"
    "time"
)
// set to 1.6 million - not 16 million - to save some time when debugging
// I made this a global so the benchmarks use the same number
const countTo uint64 = 1600000
func main() {
    fmt.Println("here we go. Ol'skool serial prime-finder")
    start := time.Now()
    primes := primeSequential(countTo)
    fmt.Printf("%8d || time: %6.2f || primes: %d\n", countTo, time.Since(start).Seconds(), primes)
    fmt.Println("and now the concurrent way!")
    start = time.Now()
    primes = primeConcurrent(countTo)
    fmt.Printf("%8d || time: %6.2f || primes: %d\n", countTo, time.Since(start).Seconds(), primes)
}
func primeSequential(limit uint64) int32 {
    var primeCount int32
    for i := uint64(2); i < limit; i++ {
        if isPrime(i) {
            primeCount++
        }
    }
    return primeCount
}
func primeConcurrent(limit uint64) int32 {
    mu := &sync.Mutex{} // protects primeCount
    var primeCount int32
    wg := &sync.WaitGroup{}
    nworkers := uint64(runtime.GOMAXPROCS(-1))
    work := limit / nworkers
    for i := uint64(0); i < nworkers; i++ {
        begin := work * i
        end := begin + work
        switch i {
        case 0:
            // start at 2
            begin = 2
        case nworkers - 1:
            // last worker picks up the slack
            end = limit
        }
        wg.Add(1)
        go func() {
            defer wg.Done()
            for i := begin; i < end; i++ {
                if isPrime(i) {
                    mu.Lock()
                    primeCount++
                    mu.Unlock()
                }
            }
        }()
    }
    wg.Wait()
    return primeCount
}
func isPrime(number uint64) bool {
    var i uint64
    var s uint64
    s = uint64(math.Sqrt(float64(number)))
    if number > 2 && number%2 == 0 {
        return false
    }
    for i = 3; i <= s; i += 2 {
        if number%i == 0 {
            return false
        }
    }
    return true
}

piPrime_test.go
package main
import "testing"
func BenchmarkPrimes(b *testing.B) {
    for i := 0; i < b.N; i++ {
        // primeSequential(countTo)
        primeConcurrent(countTo)
    }
}

Benchmarks
I ran the benchmarks with:
go test -bench Primes -count 10 |tee concurrent.txt

Switching the comment in the benchmark and teeing to the correct file for the version I wanted to run and ended up with:
$ benchstat sequential.txt concurrent.txt
name      old time/op  new time/op  delta
Primes-2   1.01s ± 1%   0.62s ± 1%  -38.46%  (p=0.000 n=9+8)

Further Optimizations
Optimizations I can think of (without having profiled them):

workers should aggregate primes locally, then update primeCount once at the end; this should reduce mutex contention
isPrime should calculate Sqrt after the even check is done

siritinga:

Your problem is that you are feeding 4 values to 4 goroutines and then waiting for the reply to send other 4 values, starving the goroutines.

I've changed your code to reuse isPrime in both versions, creating a goroutine that feeds the prime goroutines and also using time.Time to compute durations.

https://pastebin.com/2h4M2ViG

In my PC I can see a clear improvement for big values, however for small values the overhead of channels doesn't worth the effort of parallelizing.

Regards.

Zy14rk:

Thanks - that worked very nicely. Uploaded the new version to the repo with credit going to you where it is due :)

I got to remember that pattern of using go-routines. Very nice. Thanks again!

joushou:

I've left a comment elsewhere, but batching (split range in 4, instead of looping over it) and only sending primes back (with a "done" message at the end) would give a very drastic boost as well.

joushou:

It's slow because of two things:

You're sending the result for every integer from all the goroutines down a channel that has a tiny 100 number buffer, to be consumed by 1 goroutine. That means that your code cannot run any faster than that single goroutine can consume results. Your workers are waiting to get rid of their result. Send fewer results (batch, only send on prime, ...).
You're sending too small tasks to the goroutines - your primitive isPrime function only becomes expensive once the number is above a few hundred thousand, and for most of your run, sending the request to the worker is harder than the task itself. This means that your workers are waiting to get new requests, rather than spending time working. Instead, batch your requests (send a range, not a number).

There are way more things to consider, and concurrent code will always have a penalty for its concurrency, but a good thing to consider is that you want your workers to run on their own, without talking to anyone, for as long as possible.

epiris:

To start use the time package instead of that milis stuff, it's probably not even accurate.
start := time.Now()
fmt.Println(time.Now()-start) // 14ms

As sort of mentioned in another post, though not due to caching, it's not written to benefit from concurrency. It's just a sequential loop with the added overhead of synchronization. Instead make a Range struct that sends a worker a large batch of numbers to find. I.e:
type Range struct { start, end }
for I < 16m, I+= batchsize
  queue <- Range{I, I + batchSize }

Now you can remove a total of 16*n million runtime scheduler events needed that pause your goroutine, including accepting results, sending results, etc. Now that you are working on larger ranges you can ditch trial division, it is the slowest method for finding primes for larger numbers. Since your script goes to 16 million you should use sieve of atkin for larger ranges or eratosthenes (check spelling) since they are much more efficient for large numbers. At this point things should be much faster than the sequential.

dlsniper:

Read and apply this https://blog.golang.org/profiling-go-programs and https://making.pusher.com/go-tool-trace/

nevyn:

I'd start with what others have done:

https://www.google.com/search?q=golang+sieve+of

Eg.

https://play.golang.org/p/9U22NfrXeq

NoEffex:

I think you need to set the number of cores to use. 
import "runtime"
runtime.GOMAXPROCS(runtime.NumCPU())
This allows GO to use all cpu threads

dlsniper:

You don't need to this since a few releases now, iirc Go 1.4 or 1.5.

NoEffex:

Oh right I didn't know that

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

636 次点击

加入收藏微博

goroutine

runtime

channel

godoc

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Trouble parallelizing go routines (on a Pi 3)

`piPrime.go`

`piPrime_test.go`

Benchmarks

Further Optimizations

用户登录

今日阅读排行

一周阅读排行

最新主题