Scaling option for Go service on a 64 core machine

xuanbao · · 531 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>So we wrote a go service which does 30k+QPS using ZMQ(via cgo calls) on a 16 core EC2 instance.</p> <p>We now want to put it on a private cloud with 56 core VM. We want to scale up and there are 2 ways we can do it.</p> <ol> <li>Launch 4 instances of same service as it is with config changes.</li> <li>Launch 4 instances of same service with following option. <ul> <li>GOMAXPROCS=16 ./run_my_unoptimized_code</li> </ul></li> </ol> <p>We can perhaps ping each service to 16 cores each via taskset.</p> <p>Which one would be wise choice? Does anyone have experience with scenarios like this?</p> <p>FWIW, the service uses 3 sets of channels for a pipeline pattern and around 50 goroutines if it matters. </p> <p>Sorry, code is owned by employer so cannot share :( </p> <hr/>**评论:**<br/><br/>kostix: <pre><p>Two points:</p> <ul> <li>The result of pinning your instances to <code>taskset</code> is <em>almost</em> the same as specifying <code>GOMAXPROCS</code>: the instance would see exactly the number of cores it was pinned to, and that would make it create that many <code>P</code>s (&#34;processes&#34;—thigies used to run goroutines on OS threads). Of course, the remaining difference is that in the case of using <code>taskset</code>, the OS won&#39;t schedule the instance&#39;s threads on the other 16 cores.</li> <li>The scheduler in the Go runtime is itself concurrent: that is, it is not implemented as something monilithic protected by a global lock—quite on the contrary, as much as can be done concurrently, is done concurrently. Specifically, each <code>P</code> has its own run queue of the goroutines, and different <code>P</code>s are able to steal goroutines from the runqs of other <code>P</code>s w/o touching the global scheduler state.</li> </ul> <p>That said, one another point to possibly consider is that goroutines are not <strong>fully</strong> preemptible (that&#39;s actually a good thing but read on): this means that long runs of Go code which do not call any functions could effectively &#34;pin&#34; a goroutine to its underlying <code>P</code> (and hence to its underlying <code>M</code>—the &#34;machine&#34;, an OS thread) preventing fair distribution of CPU quanta across goroutines. In such cases, the more <code>P</code>s ther scheduler has, the better, but such cases are pathological anyway. You could try to see whether you have such a case by inspecting a so-called &#34;scheduler trace&#34; captured over a run under a typical workload—see <a href="https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs" rel="nofollow">this</a>.</p></pre>fakeNAcsgoPlayer: <pre><p>Thanks, we decided to move ahead with default options and let OS choose what is best for the 4 processes.</p></pre>robe_and_wizard_hat: <pre><p>Benchmark both scenarios and see which one performs better.</p></pre>tmornini: <pre><p>This.</p></pre>fakeNAcsgoPlayer: <pre><p>Don&#39;t have this luxury as we do not own the Cloud, hence the question here. :) </p></pre>tuxlinuxien: <pre><p>Since you have more cores, why don&#39;t you let your process using all the cores?</p></pre>fakeNAcsgoPlayer: <pre><p>I would gladly, I am curious what should be strategy? Let OS handle each instance or restrict each instance to a fixed number of cores? </p> <p>I am leaning towards not touching anything and see how it goes. I am assuming even though runtime among the instances cannot communicate with each other, OS would fairly allocate resources for all 4 of them.</p></pre>tuxlinuxien: <pre><p>Well, If you only want to limit the number of cores per process, it will require you to setup CPU scheduling. Let&#39;s imagine you have 40 cores and you want to launch 4 processes,</p> <ul> <li>P1 =&gt; core 0-9</li> <li>P2 =&gt; core 10-19</li> <li>P3 =&gt; core 20-29</li> <li>P4 =&gt; core 30-39</li> </ul> <p>As you see, each process will not share core resources (the system will consume some processing power still) but it&#39;s a bit harder to set up.</p> <p>From my point of view, I would just let my one process using all the cores, It&#39;s easier to setup. if you still want to go with your way, then check <a href="https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/ch01.html" rel="nofollow">cgroups</a></p></pre>fakeNAcsgoPlayer: <pre><p>Looks like that is what I am going to do.</p></pre>lexpi: <pre><p>This maybe a stupid question but since it&#39;s on the same box any reason to have 4 instances can&#39;t you just have 1 larger?</p></pre>fakeNAcsgoPlayer: <pre><p>Well like I mention in OP, the code uses CGO, so it won&#39;t scale as you are expecting.</p> <p>Also, redundancy is nice to have. Plus scaling via process model is not only simple, it is easy to reason about. </p></pre>lexpi: <pre><p>I understand the redundancy reason, but what about cgo calls that make the single process instance scaling up difficult? Not arguing just genuinely curios.</p></pre>tmornini: <pre><p>With that many go routines, and assuming it bakes out all cores, simply running it on a larger machine may well result in higher throughout.</p> <p>If not, I&#39;d wrap the entire pipeline structure and launch as many pipelines (the current set of go routines) as you desire...</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

531 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传