<p>So we wrote a go service which does 30k+QPS using ZMQ(via cgo calls) on a 16 core EC2 instance.</p>
<p>We now want to put it on a private cloud with 56 core VM. We want to scale up and there are 2 ways we can do it.</p>
<ol>
<li>Launch 4 instances of same service as it is with config changes.</li>
<li>Launch 4 instances of same service with following option.
<ul>
<li>GOMAXPROCS=16 ./run_my_unoptimized_code</li>
</ul></li>
</ol>
<p>We can perhaps ping each service to 16 cores each via taskset.</p>
<p>Which one would be wise choice? Does anyone have experience with scenarios like this?</p>
<p>FWIW, the service uses 3 sets of channels for a pipeline pattern and around 50 goroutines if it matters. </p>
<p>Sorry, code is owned by employer so cannot share :( </p>
<hr/>**评论:**<br/><br/>kostix: <pre><p>Two points:</p>
<ul>
<li>The result of pinning your instances to <code>taskset</code> is <em>almost</em> the same as specifying <code>GOMAXPROCS</code>: the instance would see exactly the number of cores it was pinned to, and that would make it create that many <code>P</code>s ("processes"—thigies used to run goroutines on OS threads). Of course, the remaining difference is that in the case of using <code>taskset</code>, the OS won't schedule the instance's threads on the other 16 cores.</li>
<li>The scheduler in the Go runtime is itself concurrent: that is, it is not implemented as something monilithic protected by a global lock—quite on the contrary, as much as can be done concurrently, is done concurrently. Specifically, each <code>P</code> has its own run queue of the goroutines, and different <code>P</code>s are able to steal goroutines from the runqs of other <code>P</code>s w/o touching the global scheduler state.</li>
</ul>
<p>That said, one another point to possibly consider is that goroutines are not <strong>fully</strong> preemptible (that's actually a good thing but read on): this means that long runs of Go code which do not call any functions could effectively "pin" a goroutine to its underlying <code>P</code> (and hence to its underlying <code>M</code>—the "machine", an OS thread) preventing fair distribution of CPU quanta across goroutines. In such cases, the more <code>P</code>s ther scheduler has, the better, but such cases are pathological anyway. You could try to see whether you have such a case by inspecting a so-called "scheduler trace" captured over a run under a typical workload—see <a href="https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs" rel="nofollow">this</a>.</p></pre>fakeNAcsgoPlayer: <pre><p>Thanks, we decided to move ahead with default options and let OS choose what is best for the 4 processes.</p></pre>robe_and_wizard_hat: <pre><p>Benchmark both scenarios and see which one performs better.</p></pre>tmornini: <pre><p>This.</p></pre>fakeNAcsgoPlayer: <pre><p>Don't have this luxury as we do not own the Cloud, hence the question here. :) </p></pre>tuxlinuxien: <pre><p>Since you have more cores, why don't you let your process using all the cores?</p></pre>fakeNAcsgoPlayer: <pre><p>I would gladly, I am curious what should be strategy? Let OS handle each instance or restrict each instance to a fixed number of cores? </p>
<p>I am leaning towards not touching anything and see how it goes. I am assuming even though runtime among the instances cannot communicate with each other, OS would fairly allocate resources for all 4 of them.</p></pre>tuxlinuxien: <pre><p>Well, If you only want to limit the number of cores per process, it will require you to setup CPU scheduling.
Let's imagine you have 40 cores and you want to launch 4 processes,</p>
<ul>
<li>P1 => core 0-9</li>
<li>P2 => core 10-19</li>
<li>P3 => core 20-29</li>
<li>P4 => core 30-39</li>
</ul>
<p>As you see, each process will not share core resources (the system will consume some processing power still) but it's a bit harder to set up.</p>
<p>From my point of view, I would just let my one process using all the cores, It's easier to setup. if you still want to go with your way, then check <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html" rel="nofollow">cgroups</a></p></pre>fakeNAcsgoPlayer: <pre><p>Looks like that is what I am going to do.</p></pre>lexpi: <pre><p>This maybe a stupid question but since it's on the same box any reason to have 4 instances can't you just have 1 larger?</p></pre>fakeNAcsgoPlayer: <pre><p>Well like I mention in OP, the code uses CGO, so it won't scale as you are expecting.</p>
<p>Also, redundancy is nice to have. Plus scaling via process model is not only simple, it is easy to reason about. </p></pre>lexpi: <pre><p>I understand the redundancy reason, but what about cgo calls that make the single process instance scaling up difficult? Not arguing just genuinely curios.</p></pre>tmornini: <pre><p>With that many go routines, and assuming it bakes out all cores, simply running it on a larger machine may well result in higher throughout.</p>
<p>If not, I'd wrap the entire pipeline structure and launch as many pipelines (the current set of go routines) as you desire...</p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传