Kill child goroutines on SIGKILL / detect main thread has been terminated?

polaris · · 1499 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>** --final edit: **</p> <p>Here is the issue still happening on a windows 7 machine. <a href="https://imgur.com/a/K1XRY" rel="nofollow">https://imgur.com/a/K1XRY</a></p> <p>When I tried searching for this issue, the only thing I came across was that older versions of mingw bash send SIGKILL instead of SIGTERM when you ctrl-c. My issue has been resolved by running go with native terminals / modern operating systems, but there&#39;s the screenshot to show the issue caused by the linked go code when SIGKILL happens. </p> <p><strong>rest of the post</strong></p> <p>--edit: Here is a code sample to demonstrate what I&#39;m doing. It&#39;s in its early phases now so this is basically the entire thing, haha.</p> <p><a href="https://play.golang.org/p/60ucZvDhEA" rel="nofollow">https://play.golang.org/p/60ucZvDhEA</a></p> <p>--edit2: wow, the go playground is very impressive. It even prints the messages back with the delay. That&#39;s good to demonstrate my concern - if I SIGKILL a program, you will keep seeing those messages until the resource count is 0 - but in the next phase there will be a variable amount of goroutines and some increment that counter instead of decrementing it. I CAN use a platform native shell so ctrl-c is SIGINT instead of SIGKILL but I am just curious at this point. </p> <p>--edit3: <strong>ISSUE RESOLVED</strong> Well, I did not find a way to trap/detect SIGKILL, but I did test the mingw bug on my windows 10 machine and it appears to have been fixed. My windows 7 machine is what originally had the issue. I&#39;ll get a screenshot monday when I have access to that machine just so anyone curious can see the issue. It&#39;s pretty amusing to see the ctrl-c, the terminal prompt, and then the goroutines continuing to run in the background and print to the screen after that point. </p> <p><strong>Original Post</strong></p> <p>Hey, I have a ridiculous edge / use case. I am running a little simulation that may turn into a sort of game in the future. I have a struct with a mutex and a count. Several goroutines take turns at different random intervals locking the mutex, decrementing the counter, incrementing their own counter, and finally unlocking the mutex. When the count is 0, the goroutines break their infinite loop and send their own little kill signal to the main thread which closes a for-range loop that is logging messages from the many child processes decrementing the counter. </p> <p>I hit ctrl-c when running via git bash on a windows OS and this sends SIGKILL instead of SIGINT (documented mingw64 bug, probably not getting fixed anytime soon). I could just use a native windows terminal/cmd, but I&#39;m concerned about what happens if the program crashes or is killed for any other reason. When I killed the main process attached to the terminal, the goroutines were still running and the logic inside the main thread/program/whatever (relatively new to go, my terminology is weak) was still receiving and printing messages to the terminal! Originally I let each goroutine print to the console and thought that was a reasonable case for them to continue printing after the main program died. Now I have the main thread block and read messages from the channels... but that also appears to live on and continue running after the process is killed. </p> <p>My larger concern is about the next phase of the project. An unknown number of child processes in the simulation and some that can add to the counter instead of taking from it. It could go on until it is killed.... but tracking down and killing a few thousand goroutines sounds unreasonable. </p> <p>There is a lot of noise when searching for this issue because my intentions are a little different. I really hope someone knows of a way to either capture the SIGKILL so I can tell my goroutines to die / panic or detect that the main thread is no longer running. And maybe also get an explanation of how the channel code was still running after receiving a SIGKILL - maybe something to do with how go allocated memory to the heap instead of the stack when it is uncertain about the scope of a variable?</p> <p>Ideas I&#39;ve had so far but haven&#39;t worked out...</p> <ul> <li><p>use os.signal and notify to observe the kill signal and break from the goroutines, failed</p></li> <li><p>use os.signal and notify to mutate a global variable the goroutines observe and break if it changes, failed (because we can&#39;t observe / do anything when SIGKILL happens :( )</p></li> <li><p>communicate with the main thread with a channel so when it dies the goroutines panic, which kills them all, failed because the channel &amp; logic remained open after the program terminated... somehow....</p></li> <li><p>communicate via a system file that is updated by the main thread and checked by the goroutines so when it stops getting updated they know the program died, failed because the main thread is already blocked from reading channels and if I move either of those out of the main thread I won&#39;t be able to observe the kill signal / thread termination</p></li> <li><p>do some platform specific crap to monitor the main PID from the goroutines and die if the main one dies, failed because it is a stupid idea since PIDs can be reused and the solution requires a different implementation per OS...</p></li> </ul> <h1></h1> <p>Any help would be greatly appreciated :) I know I can do something like a server to tell the goroutines to die / control them manually, but there MUST be something I can do, right? </p> <hr/>**评论:**<br/><br/>TheBeasSneeze: <pre><p>Use <a href="https://golang.org/pkg/context/" rel="nofollow">context</a> although this sounds like a really strange edge case bug :s</p></pre>Tomnnn: <pre><p>Thanks, that looks useful! However, some testing shows that sigkill doesn&#39;t exactly terminate the main thread, so the code I&#39;m using as a sort of keepalive with that timeout context would still be there. </p> <p>If a golang process <strong>crashes</strong>, is there a panic that will kill the child processes? What even causes a kill signal? Maybe I shouldn&#39;t worry about this since I will be able to catch ctrl-c if I get over my preferences and run a regular windows terminal :P</p> <h1></h1> <p>And yea I have a crappy little $300 machine dedicated to stuff like this. I&#39;m going to use all available resources on it to create as many routines as possible and each one will be a primitive ai and they will use mutex synced objects to interact with common resource and substract/add to those resources. It&#39;ll be like a colony of primitive organisms :D</p> <p>I don&#39;t know if anything useful will come out of this, but it&#39;s amusing. ctrl-c will be sufficient, 1 goroutine can watch that and set a flag / panic to shut off the rest of them. </p> <p>--edit:</p> <p>Oh and just to make sure I understand - is context a light weight alternative to passing a channel to a goroutine for signaling? For example, I&#39;ve seen lots of examples for control signals where you pass 1 channel along for receiving data and 1 struct{} channel for control signals. I think that &#34;withvalue&#34; context might serve the purpose people usually use a struct{} channel for to tell the main thread that a goroutine did a thing. </p></pre>jerf: <pre><p>There&#39;s a lot of things to unpack here and it&#39;s not clear to me everything that&#39;s going on.</p> <p>First, go has &#34;goroutines&#34;. If you happen to call them &#34;threads&#34; every so often it isn&#39;t that big a deal, plus I still see the phrase &#34;thread-safe&#34; used sometimes, possibly because &#34;goroutine-safe&#34; is just a kinda klunky phrase. But it&#39;s important to be clear that they all live in on OS process, so there&#39;s no way for goroutines to be living on past the termination of their host process.</p> <p>Second, there is no way to &#34;kill&#34; a goroutine. You can only write code that will cause them to eventually kill themselves. So for instance if you have goroutines doing some calculation in a tight loop without ever &#34;looking up&#34;, the only way to control them is to modify them to &#34;look up&#34; for the sign they need to kill themselves every so often. If they need to be able to checkpoint their calculations or something, you have to write that in yourself.</p> <p>The most popular mechanism is to have a channel that gets closed, then using the checks on channels being closed to see if the process terminates. However, there&#39;s nothing necessarily special about that; channels are used in the default context package (referenced by TheBeasSneeze) because they work well in a context where the core thing being done is a <code>for { select { ... } }</code> loop, but anything that allows a goroutine to &#34;look up&#34; and see they need to stop works. Just be aware that if you&#39;re trying to get a whole bunch of goroutines to use the same signal you need to start worrying about the contention on the signal itself.</p> <p>(One of the things about using Go that took me a while to accept is that it&#39;s fine to take a channel, which has a whole bunch of behaviors, and use them for just a part of that behavior, like whether they are closed, or sending down <code>struct{}</code>s to just use them for their syncing, or whatever. The existence of a channel does not obligate you to use all of its functionality.)</p> <p>The other problem here I <em>think</em> is that you either have a bookkeeping issue, or a major runtime flaw in your environment that is the real problem. The duration of a Go program&#39;s OS process is <a href="https://golang.org/ref/spec#Program_execution" rel="nofollow">specified to be the duration of the initial goroutine that executions the program&#39;s <code>main</code> function</a>. Once that terminates, the OS process terminates. There is no way that goroutines can continue running in the background for very long past that, and even if you are able to witness some side effect that says they are continuing to run, the process is still on its way down and nothing can be relied on after that.</p> <p>I would carefully check your bookkeeping in the main routine, and whatever else it uses, and ensure that you do not accidentally end up running some <code>go</code> somewhere that accidentally moves the logic of what you thought was the &#34;main&#34; program into a new goroutine, while the original goroutine gets frozen into waiting on something. (This is pretty easy to do with closures.)</p> <p>Alternatively, if you can create a reduced test case that <em>proves</em> that under your environment program execution and other goroutines can substantially outlive the original goroutine, the only logical and correct move is to submit that as a bug to the Go bug tracker. This is the kind of bug you do <em>not</em> want to try to just bash around in your client code. Trust me. This is one of those cases that once you&#39;re sitting on top of a substrate acting incorrectly, the stuff sitting on top of it simply can not properly &#34;fix up&#34; the underlying layer. I can assure you that the bug will be taken seriously. (I can&#39;t promise it will be immediately fixed or whatever, because there&#39;s other concerns and issues that arise there. But I am confident it will be taken quite seriously.) I kinda doubt this is it, but if it <em>is</em> the problem it&#39;s a huge one.</p></pre>Tomnnn: <pre><blockquote> <p>Second, there is no way to &#34;kill&#34; a goroutine. You can only write code that will cause them to eventually kill themselves.</p> </blockquote> <p>I did this with a panic, I think. I set a handful of goroutines to print to the console on a loop forever with no termination, and then I had another goroutine wait 5 seconds and then panic for no reason. This killed every goroutine started by the main process. My issue is I have no idea how to determine if the main / parent process is still running or not since we can&#39;t observe sigkill. </p> <blockquote> <p>The most popular mechanism is to have a channel that gets closed, then using the checks on channels being closed to see if the process terminates. However, there&#39;s nothing necessarily special about that; channels are used in the default context package (referenced by TheBeasSneeze) because they work well in a context where the core thing being done is a for { select { ... } } loop, but anything that allows a goroutine to &#34;look up&#34; and see they need to stop works. Just be aware that if you&#39;re trying to get a whole bunch of goroutines to use the same signal you need to start worrying about the contention on the signal itself.</p> </blockquote> <p>The issue for me is that the goroutines are watching a value that is being decremented. They all break when it is 0, and they take turns decrementing it, but in the next phase of the pointless project some goroutines will be adding to it. I&#39;d like some way to kill the processes for that reason because the point of the pointless venture is the thing they&#39;re watching to know when to die will be ever changing. </p> <blockquote> <p>Once that terminates, the OS process terminates. There is no way that goroutines can continue running in the background for very long past that, and even if you are able to witness some side effect that says they are continuing to run, the process is still on its way down and nothing can be relied on after that.</p> </blockquote> <p>I&#39;ll get a code sample to demonstrate the issue. Maybe that&#39;ll properly emphasize the terrible design is the purpose of the project ;) Thanks for taking a look!</p> <p>Took a few minutes to recreate it from memory, but here you go! <a href="https://play.golang.org/p/60ucZvDhEA" rel="nofollow">https://play.golang.org/p/60ucZvDhEA</a></p> <p>I&#39;ll add this to the main post as well so some people can give me some help after telling me how pointless my idea is.</p></pre>epiris: <pre><p>Okay- so the main issue here is you do not have clear program flow, you are using conditions and synchronizations to &#34;signal&#34; exiting rather than the natural way you do when writing software: returning to your caller. This problem could be solved following the general rules I posted in my other reply, key take away being context.Context, <a href="https://godoc.org/golang.org/x/sync/errgroup" rel="nofollow">errgroup.Group</a> and indicate task completion by returning to caller.</p></pre>Tomnnn: <pre><blockquote> <p>Okay- so the main issue here is you do not have clear program flow</p> </blockquote> <p>I wanted to do a little primitive organism simulation with each having its own <em>thread</em> to act on instead of iterating over them like most games would. It&#39;s designed strangely but intentionally. </p> <h1></h1> <blockquote> <p>indicate task completion by returning to caller.</p> </blockquote> <p>The program terminates when the counter is zero. Or in the eventual end goal of this project, when there is no food left for the little sims to consume. If the program receives a SIGKILL I seem to have no way tell the goroutines to stop. Is this just a flaw I have to live with? In the code example I provided - that continues even if that &#34;main thread&#34; gets killed. </p></pre>epiris: <pre><blockquote> <p>I wanted to do a little primitive organism simulation with each having its own thread to act on instead of iterating over them like most games would. It&#39;s designed strangely but intentionally.</p> </blockquote> <p>Intentionally incorrect is still incorrect. You can achieve your desired behavioral properties and still maintain a correct program. Do you think your game is the first one that had a group of objects which wanted to act independently...? Then do you think that the best design pattern in game engines with complex rendering pipelines and game events based on various goals being reached really would approach this by just starting a ton of threads and atomicly incrementing a counter? No. They don&#39;t. That is poorly designed software in any langue in any problem domain. That is why you are replying to posts on Reddit, because your software design is causing you issues.</p> <p>Interface type Ticker has method Tick(time.Duration), Struct World (Worker) has slice of Struct GameObject (organisms) and both implement Ticker. While context is not done call method Tick on world with time.Duration being the time since the laser cal to Tick. Worlds Tick method calls each child Tick method. When organism ticks simulate organism behavior by updating fields performing linear interpolation against the Tick time. Now you have a truly independent organism and you can use fields to increase or grow it&#39;s attributes artificially or randomly to get the desired game behavior. You could also interact with a World game object in each organism because it&#39;s a thread safe call. You could implement a behavior tree for complex interactions between organisms. </p> <p>There now your using proper software design and can actually simulate arbitrary conditions with your organisms rather than have their behavior based on the language runtimes schedulers underlying implementation of atomic primitives, which is hardly like indecently operating organisms.</p> <blockquote> <p>If the program receives a SIGKILL I seem to have no way tell the goroutines to stop.</p> </blockquote> <p>Yes, you are creating your own problems here. Again, you do not have clear program Flow like the example I gave you. </p></pre>gargamelus: <pre><p>When the main program (thread) exits, all goroutines immediately go away. You don&#39;t need to detect anything and don&#39;t need to notify any goroutines. This is how go works. (And Linux does so that on SIGKILL the process goes away without any opportunity to do anything.)</p> <p>I don&#39;t see how your playground link demonstrates that goroutines continue running. Take a simpler example: <a href="https://play.golang.org/p/D1k-k0Vox7" rel="nofollow">https://play.golang.org/p/D1k-k0Vox7</a></p> <p>When the main process exits, the goroutine stops printing.</p></pre>Tomnnn: <pre><p>I just ran this myself on linux and it looks like interrupt does kill the threads. I guess this is a bug specific to windows &amp; mingw? </p> <p>Well, since it works on my target platform, I guess I don&#39;t care anymore :P I&#39;ll get a screenshot of the issue though for the doubters. Expect an update to the original post in ~10 minutes.</p> <p>--edit: oh my, it looks like this has been fixed on windows 10 and is only an issue for the out of date version on my other win7 machine! I guess if SIGKILL is not something we should expect to happen, then I&#39;ll stop worrying about it?</p></pre>losinggeneration: <pre><p>What does <a href="https://play.golang.org/p/razCG09UpL" rel="nofollow">https://play.golang.org/p/razCG09UpL</a> give you when you run it on Windows and press ^C ? It may be you were watching for the wrong signal. I don&#39;t have a Windows box to test currently.</p></pre>Tomnnn: <pre><p>Thanks for the suggestions, but I have tested this. ctrl-c from a command prompt does send sigint, but I tend to use mingw64 git bash because I like access to so many utilities in 1 terminal :)</p> <p>Any serious dev would be done on linux so its a non issue, but I am also concerned about the program crashing at some point and leaving behind a million goroutines in the background - because my project is a pointless simulation to strengthen my concurrency knowledge. </p></pre>epiris: <pre><p>Here is an <a href="https://gist.github.com/cstockton/77e38b16999f382fa0ef3060d785413a" rel="nofollow">example</a> I made some time ago with a signal handler added. Though in general you should avoid trapping signals, if you do and the intent is using it as a synchronization mechanism it&#39;s a strong indicator of a design issue with the program. As for everything else a few general concurrency rules that I follow which can apply to most programs to help ensure they are correct:</p> <ul> <li>You should always have a top level context in main, even if it&#39;s context.Background()</li> <li>This context should be given to all function calls that perform long-running tasks, they should always have at least this signature: func(Context) error, so they can cancel work when context is done and report a error to distinguish task failure from cancellation.</li> <li>All functions that create other goroutines should always follow the rule above, but never exit until all goroutines they have created have exited.</li> </ul> <p>There are exceptions to these rule at times, such as creating a service that can start/stop that runs workers and in those cases I ensure that Stop() does not return unless the goroutine started by Start returns. The general theme here though is you always &#34;close the loop&#34; so to speak, any call site that starts a goroutine is responsible for ensuring it exits. Golden rule: work is done when the call returns. If you start goroutines and try to join on a condition that is <em>not</em> them returning to their callers programs get very difficult to debug. Without seeing code it&#39;s hard to know what you&#39;re running into, but maybe these rules may help.</p></pre>Tomnnn: <pre><p>My issue when testing this is that it could not capture sigkill. If something terminates the program that way, those child processes keep going..</p> <p>--edit: this appears to be a bug with mingw64 on windows 7. Will update tomorrow afternoon with proofs. Looks like a non-issue since sigint stops child threads on a proper OS. </p></pre>kemitche: <pre><p>SIGKILL is intentionally uncatchable. It&#39;s the OS forcibly shutting down the program, usually because it&#39;s misbehaving by not responding to SIGTERM.</p> <p>SIGINT and/or SIGTERM are the &#34;soft&#34; kill signals that can be trapped by the program being interrupted.</p></pre>Tomnnn: <pre><p>I thought it was necessary to do additional cleanup, but yesterday when I tested go on galliumOS and win10, the goroutines died when the program received SIGINT. In about 2 hours I&#39;ll be updating the main post with a screenshot of the bug with mingw on wind7. </p> <p>Though I did initially explain I tried signal trapping and acknowledge sigkill cannot be caught (bullet points in OP if anyone missed it), and that was still half of the responses, and I got downvotes for asking questions / clarifying I&#39;ve tried that to those responses. Gives a pretty bad impression of the community and probably won&#39;t post here again for issues :/</p> <p>Going to post that screenshot and abandon ship. </p> <p>--edit: Updated the main post with <a href="https://imgur.com/a/K1XRY" rel="nofollow">the image </a></p></pre>gargamelus: <pre><p>Can you please post the source code as well? I have been running Go on Win7 and msysgit mingw bash without problems.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

1499 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传