Monitoring for panics?

blov · · 518 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I&#39;ve been using Go for 3+ years now and still have no good approach to panics. Obviously I have upstart/systemd/whatever managing the process so it automatically restarts after a panic, but I&#39;d like to be <em>alerted</em> when something panics.</p> <p>Googling quickly reveals that the Go team doesn&#39;t believe in any kind of <a href="https://www.google.com.hk/search?q=golang+reporting+panics&amp;oq=golang&amp;aqs=chrome.0.69i59j69i57j69i60l3j69i65.1005j0j7&amp;sourceid=chrome&amp;ie=UTF-8">global panic handler</a> and <code>deferring</code> a <code>recover</code> inside <code>main</code> doesn&#39;t work either because it doesn&#39;t catch other goroutines panicking.</p> <p>Basically my &#34;state of the art&#34; for this is using statsd to draw vertical red lines in graphite whenever the process starts up, and if I see it start when it wasn&#39;t intentionally restarted by the team then I know it&#39;s probably recovering from a crash. So when I see that I just go crawl through the STDERR logs to find the traceback.</p> <p>It seems like a really dumb way of operating. I&#39;d like to, for example, automate dumping stacktraces into Slack whenever a service panics. I spent a few hours one day trying to write a program that tails my upstart logs and parses out panics but it was such a fragile pain in the ass I gave up.</p> <p>Anyone have any pointers here?</p> <hr/>**评论:**<br/><br/>peterbourgon: <pre><p>Why are your services panicking so much? It&#39;s a legitimate question, I&#39;ve been writing Go since it&#39;s release and I can&#39;t remember the last time I had a panic outside of a test or some really prototype code I was developing on my laptop. Rather than figuring out a way to effectively manage these things, it seems like your energy is better spent stopping them from occurring in the first place. It&#39;s possible! </p></pre>excited_by_typos: <pre><p>Fair question. I agree, we do spend time fixing panics and my default our services don&#39;t panic often if at all. But there have been instances when a change was made that introduces panics in production. I&#39;m mostly interested in detecting this.</p></pre>medecau: <pre><p><em>Serious question:</em> How do you know you are not getting panics in production?</p></pre>peterbourgon: <pre><p>If my program panics it typically crashes. When my program crashes it&#39;s picked up by my supervisor (typically runit) and my monitoring. If it&#39;s a contained panic e.g. an HTTP handler it&#39;s picked up in the logs which I see.</p> <p>edit: And the class of things that can induce panics is not particularly large. Nil pointers and slice out-of-bounds errors account for the majority, and once you train your eyes a little bit it&#39;s easy to spot the conditions where those things might occur on a skim-through.</p></pre>jerf: <pre><p>In my opinion: As cheap as the <code>go</code> keyword may superficially be to use at first, it isn&#39;t <em>quite</em> as cheap as the tutorials may indicate if you&#39;re going to be writing production code. I believe I&#39;ve seen Dave Cheney make the point that you should always know how a goroutine will end, and <em>technically</em> this is a special case of that consideration, but I&#39;d add that you should always explicitly know what will happen if a goroutine panics, because, indeed, every time you type <code>go</code> and start a goroutine that may panic with no shielding, you are writing something that could take down your entire program. Which could be doing tens of thousands of things at a time, or more. Quite inconvenient.</p> <p>Just as <code>if err != nil { return err }</code> is sort of a default you can slap down and subject to modification as you discover what else you need, I often in a very similar way slap down</p> <pre><code>defer func() { if r := recover(); r != nil { WhateverYouAreLoggingWith(&#34;some message for context: %v&#34;, r) } }() </code></pre> <p>on <em>every</em> goroutine, one way or another. Where possible I prefer to use a library to do it; <a href="https://github.com/thejerf/suture">suture</a> for my own stuff, the HTTP library automatically handles panics in its own handlers, etc., but if I manually <code>go</code> something I need to handle it myself, too. Just as the &#34;default error&#34; handler is not always appropriate, neither is that snippet, and I often find myself modifying it. One of my more common use of <code>var</code> statements is to pull the scope of a variable up above that handler, so that handler has access to it to examine where the crash occurred to some extent, usually just packing it into the log statement somehow. (Doing <em>logic</em> on what may be data in any arbitrary in-between state is one of those things that turns out to be <em>way</em> trickier than it looks and I strongly recommend against it, but simply <em>logging</em> the intermediate state&#39;s values can be valuable.)</p> <p>Another very common thing that shows up in my final handlers is channel closing, or some other mechanism for indicating that a request has failed. Otherwise it&#39;s easy to end up with crashed processes causing other processes to deadlock themselves, and before you know it you&#39;ve got a cascading problem.</p> <p>So, basically, my proposed solution to you would be to go through every <code>go</code> statement in your code, including any libraries that may be crashing on you, and auditing that <code>go</code> statement for how the goroutine terminates, and what it does when there&#39;s a panic. The downside is that this will suck and take some time if you&#39;ve built a large backlog of such things. The upside is, well... you pretty much haven&#39;t got a choice and if you want to write reliable Go code you kinda have to do it anyhow.</p></pre>Redundancy_: <pre><p>Have you seen <a href="https://github.com/mitchellh/panicwrap" rel="nofollow">https://github.com/mitchellh/panicwrap</a> ?</p></pre>excited_by_typos: <pre><p>I think I had... my question in this thread is basically &#34;that&#39;s it?&#34;</p> <p>Pretty sad if this is the state of the art for golang crash reporting.</p></pre>Redundancy_: <pre><p>I&#39;d like to be able to use panicwrap with sentry (<a href="https://github.com/getsentry/raven-go/issues/95" rel="nofollow">https://github.com/getsentry/raven-go/issues/95</a>), but that would probably give me most of what I&#39;d want or expect compared to say, the global exception handler in Python.</p></pre>Taikumi: <pre><p>If you&#39;re open to paid solutions, I&#39;ve used Sentry (<a href="https://sentry.io/welcome/" rel="nofollow">https://sentry.io/welcome/</a>) at a few of my workplaces and only have good things to say about it.</p></pre>phonkee: <pre><p>You can host sentry on your own.</p></pre>tcrypt: <pre><p>I normally use a middleware that wraps the actual endpoint and catches any panics if they happen. But the first line of defense is to write code that can&#39;t panic wherever possible.</p> <p>Is this for cryptowatch? Good luck.</p></pre>excited_by_typos: <pre><p>Yeah we do that with HTTP handlers too. I&#39;m having trouble with a different service that has a ton of goroutines and isn&#39;t structured in such an easy way to wrap.</p> <p>And yeah it is. Thanks.</p></pre>gbitten: <pre><p>Why not wapper the goroutine to defer and recover your goroutines?</p> <p>I found (but not tested) this code snippet as an example: <a href="https://gist.github.com/glyn/9527053" rel="nofollow">https://gist.github.com/glyn/9527053</a></p></pre>excited_by_typos: <pre><p>I&#39;m looking for a solution that I can use once for all of my code, not something I have to sprinkle all over the place inside my code.</p></pre>lumost: <pre><p>I used to be a fan of deferpanic for this, but they&#39;ve since shut down. <a href="https://stackimpact.com/" rel="nofollow">https://stackimpact.com/</a> offers monitoring for panics and I believe <a href="https://newrelic.com/golang" rel="nofollow">https://newrelic.com/golang</a> also offers panic monitoring. For http services it&#39;s also useful to have a recovery middleware which catches panics and renders a 5xx page/api response.</p></pre>dericofilho: <pre><p>You might want to use an erlang-lilke supervisor: cirello.io/supervisor or <a href="https://github.com/thejerf/suture" rel="nofollow">https://github.com/thejerf/suture</a> and wrap the part that calls your code with the instrumentation information.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

518 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传