使用 Go 构建 Resilient Services - 技术会谈【未翻译】

· 2015-08-07 11:00:05 · 3297 次点击 · 预计阅读时间 31 分钟 · 大约8小时之前开始浏览

这是一个创建于 2015-08-07 11:00:05 的文章，其中的信息可能已经有所发展或是发生改变。

In this Tech Talk from GopherCon 2015, Blake Caldwell, a former Software Engineer here at Fog Creek who worked on the Kiln team, explains how he used Go to re-write and speed up KilnProxy, our SSH Reverse Proxy. Hear how he was able to re-write the service and reduce clone times by half, whilst making it more reliable and less noisy.

Blake writes about Go and software development on his blog. He has open-sourced the profiler mentioned and you can get the slides from the talk on his GitHub.

视频链接：https://youtu.be/PyBJQA4clf

About Fog Creek Tech Talks

At Fog Creek, we have weekly Tech Talks from our own staff and invited guests. These are short, informal presentations on something of interest to those involved in software development. We try to share these with you whenever we can.

还没有人翻译此段落

我来翻译

Content and Timings

Introduction (0:00)
Background (0:22)
About KilnProxy – SSH Reverse Proxy (1:33)
Results (3:35)
Handling Errors (5:20)
Channels (7:17)
Handling Panics (8:09)
Avoiding Race Conditions (10:00)
Implementing Timeouts (11:20)
Profiling (13:44)
Logging (21:20)

Transcript

Introduction

Blake:
Hello. This is another rewrite story that it went so well that I want to share with everybody. Like many talks, I’m going to be touching on a lot of things, I’m not going to deep dive on any, I just want to give you as much exposure as I can to things and tools and techniques that I found very helpful when writing my first Go production service.

Background

To give you some background, last year I was working at Fog Creek Software out of New York. If you guys have heard of us, you might be familiar with Kiwi. This is the FogBugz mascot. I was working on Kiln. Kiln is Fog Creek’s Git and Mercurial source code hosting service, it’s like a GitHub that works with Git and Mercurial. Before I worked there I hadn’t heard of Mercurial. I’m very familiar with it now. Most of my work was in C# and Python, I tried to gravitate more towards the back end stuff.

还没有人翻译此段落

我来翻译

Also last year I was lucky to attend Google I/O. I went there having no idea what Go was, I actually had never heard of it. Now everywhere you look, it’s everywhere. I don’t have to explain to you why I fell in love with it so quickly. I went to every one of the Go talks I could, I met some of the Go authors, and it was awesome. I knew that I had a solution to a problem that might come up some day.

I needed to ship something awesome in Go. There’re hobbies, there’re fun things I could do in my spare time, but I wanted to prove something at work, I wanted to use Go to make something awesome for Kiln and to make Kiln better. I looked around for something to rewrite, there’s always something you can rewrite. I settled on our SSH Reverse Proxy. I also didn’t know what a reverse proxy was when I started working there.

还没有人翻译此段落

我来翻译

About KilnProxy – SSH Reverse Proxy

To give you some background on what this is, when you’re dealing with Git or Mercurial there’re two ways to interact with a remote server. There’s HTTP and SSH. SSH uses public-private keys, it’s more secure. If you’re going to be cloning a repository from Kiln and you’re using SSH, you’re going to be talking to KilnProxy. You’re one of those people on the left there. When you talk to KilnProxy, KilnProxy has to make sure you are who you say you are by authenticating with your key and then it looks up to see where your data’s actually stored, we have lots of backend servers so it finds out which one is yours, where your repository sits. It opens up a connection to the backend server, it has a connection to you and then all it has to do at this point is proxy all the communication back and forth.

Why would I rewrite this? It was working. At the time, our SSH clones were a lot slower than HTTP and this really shouldn’t be. If anything, it feels like SSH should be faster. Also, we had some stability issues and because of this, our system administrators decided to just restart the service every day. You can imagine if you have a giant repository and you’re trying to clone it, you’re twenty minutes into a clone and then it’s time for us to restart the whole server. You’ve just lost everything, you have to start over, it’s not cool.

还没有人翻译此段落

我来翻译

This actually turned out to be a perfect project for Go because there’s tons of concurrency. To give you an idea of this, when you first connect, we’re listening on a loop, we kick off a Go routine. That Go routine is responsible for authenticating your public and private key, or your private key. It then connects to the backend server and once it’s got that connection, it then proxies standard in, standard out and standard error, each on their own Go routine.

Results

How’d it go? It went well. Way better than I could’ve imagined. I just wanted to get it to work and things went very well. Let me show you how well they went. Every few minutes or so, we would clone a small repository and we would keep track of the timing. This right here is when we transitioned from our Python implementation to Go. I’ll let you guess where it is, it’s a little bit past the pink mark there. This is just at the launch of the rewrite. At this point now, SSH and HTTP have parity.

还没有人翻译此段落

我来翻译

You could see before, this was like a 1-megabyte repository but before it was taking about one and a half seconds, afterward it was down to .75. It was almost exactly twice as fast as before. This does scale up as the repository gets bigger, so we saw some pretty good gains here. Also, you can notice that there’s a lot less noise also. What we have is a faster, more reliable, less noisy service.

Being my first Go service, I had to figure out how to write a stable service in Go and I’d like to share some of these tips here. What are we talking about with resiliency? We’re talking about writing a service that doesn’t crash, that doesn’t have to be restarted every day. It doesn’t leak memory and it doesn’t hang. Doesn’t get stuck. Obviously, this is a big process, there’s no magic bullet here. This is a process that spans development, profiling before launch and then after launch monitoring the service.

还没有人翻译此段落

我来翻译

Handling Errors

Let’s start with error handling. Luckily, not too many speakers have talked about … Have had to say we should handle every error. I think everyone here understands we should handle every error. I came from Java where I’m used to very terse code with exceptions that I have no idea if they’re thrown or how to handle them. We just don’t worry about that in Java.

We’ve all seen this, this is the pattern. You get a resource or you get a return value from a function and its error and you don’t use it under score for the error. Then after we check the error, we break out of the function if there’s an error. We defer the clean-up or the close. One thing I always make sure is to put no new lines in this little block here as I visually scan code, I want to see that this is one unit that can’t be broken apart.

还没有人翻译此段落

我来翻译

It looks like here we’re handling all the errors and so we’re good. We know that we should be checking for nil sometimes. Back to our example, what if OpenResourceA can return nil and what if it’s not an error condition? Maybe it’s a rare case that we’re trying to open the resource and for some reason it’s offline but this is not an error technically. Within our defer statement, could panic.

Not necessarily, of course we have the technology to avoid this and one way we can do this is with an inline function, an anonymous function. We can just do our little check there and if it’s not nil, then we’ll close. The one problem I find with this is that it’s gross. I don’t like this. Can everybody see this? I don’t know if that’s big enough.

还没有人翻译此段落

我来翻译

One way I like to handle this is for methods that I know will be deferred, I like terse defers, I like that to be nice and clean. One thing I actually forgot until recently is that if a method on a struct receives its struct by a pointer. If that struct pointer is nil, that function’s still called and it’s passed in a nil for its pointer. In this case here, I’m making my deferred statement, the statement that I know will be the clean-up method. I actually check for nil here and then we get back to our original example where DeferResourceA.close is actually nil proof. I find this to be a good pattern.

Channels

Let’s talk about channels. We all are familiar with channels and how much fun they are and they are fun, they are awesome but if you don’t know what you’re doing, you’re going to cause problems. I’m not going to deep dive on this because deep diving on channels would require half a day but I like to reference Dave Cheney’s blog post here which has gotten me through some tough times. It’s entitled Channel Axioms and literally every time I touch a channel, I just review this. I had it written down somewhere because I want to make sure that I don’t misstep with channels.

还没有人翻译此段落

我来翻译

The first three here are the ones I want to focus on is that if you have a nil channel and you try to write or read from it, it’s just going to block forever. When you block forever on a go-routine, that go-routine never exits and when a go-routine never exits it never frees up its resources and any resources like local variables it’s holding onto. When you send to a closed channel, you get a panic.

Handling Panics

Let’s talk about panics. They’re usually due, maybe always due, to program our error and if you get a panic it’s going to crash your service. I’m going to upset probably half the people in the room and say that I sometimes like to recover from panics. It’s a contentious issue, people say, “Well, it’s a programmer error. If you have a program error, you should let it crash and you should fix it.”

That’s true, but I do make errors and they do make it to production and I want to scale back their damage if possible. Without going into the details, you can recover from panics. You should not treat them like exceptions, it’s not what I’m doing. When I’m setting up a function, you saw that error block. I make sure that everything I’m setting up will be cleaned up when I leave the function. That way, if a panic happens, all those would be fired.

还没有人翻译此段落

我来翻译

I try to limit areas of the code, I set aside some areas of the code where panics are allowed to happen, I catch it, I do log it and I do take it very seriously and try to fix those bugs. Let me give you an example. This SSH proxy was very complicated. If I’m being honest, I didn’t read the full SSH spec beforehand. We did have one customer who was using a certain build server that was using Git in a certain way that I didn’t know until afterwards. It was crashing and we had hundreds of other customers, thousands of other customers that had no problems, didn’t have this problem.

If I had let this panic creep all the way to production, we would’ve had a fire and we would’ve had outages and we would’ve had to restart the service over and over. In that example, I handled the panic on the go-routines I mentioned earlier where you have … Which is just handling one client’s request. At the high level, the top level, if something goes wrong in the main loop, then that’s fine, that will crash.

还没有人翻译此段落

我来翻译

Avoiding Race Conditions

Let’s just pretend that the author of the race detector wasn’t just up here. I don’t need to say a whole lot about this, but race conditions, when you have all this concurrency, you’re going to have races. Again, I said I was from Java in the past and I’ve used all the concurrency stuff you can do there and I know all these tools exist for Java and despite that, in the ten years I was working with Java, I never bothered to look.

It’s super easy in Go because it’s part of the main tool suite here. Just like we heard earlier, in a report where a variable access is not synchronized, and when it does, it will crash for you and it will show you full stack trace, including exactly where the read and writes were and you should use this during your unit tests for sure but also development and integration testing.

还没有人翻译此段落

我来翻译

Again, here’s the output. This is awesome, tracking down this type of bug would’ve taken forever if you’re able to even catch it happening at all. Here we could see that the error happened where race.go line 14 was trying to read while race.go line 15 was trying to write. This literally takes minutes to solve. Again, you can enable this with the -race command line option on both test, run, build and install.

This next bit isn’t specific to Go but I think it’s important to address if you’re trying to make a service that won’t crash and that could handle some problems at run time.

Implementing Timeouts

Implementing timeouts. We need to guard against some situations here. The big one is network timeouts. Our software, our programs are connecting to remote servers quite a bit. For that, we dial and then we connect and then we transfer data and then we’re all done.

还没有人翻译此段落

我来翻译

Best practice is, if you are trying to dial, you should be dialling with a timeout. If you look at the standard library, those dial functions usually have a dial with timeouts. Let’s say it’s reasonable to expect a dial to another server to take two seconds. Make that time out twenty seconds, it doesn’t matter, just make it so that some point if there’s a problem you don’t hang forever. If you have twenty of these connections going a hundred, if all day long you’re trying to reach the server and it never answers, you’re just going to run out of memory or have a crash.

Once you have dialled, then you have to connect and you should also have a network connection and activity timeout. If it’s taking you … Actually, no, in this case, a connectivity timeout is if you’re transferring data back and forth and if you haven’t seen data in ten seconds, fifteen seconds, then you should just close connection and log it out. Also, you’re connected for a long time, maybe a minute’s reasonable, so a timeout after five minutes, a timeout after a number that’s ridiculous if it got that far.

还没有人翻译此段落

我来翻译

This next one goes without saying. I’m not going to deep dive on this but let’s not skimp on the tests. Tests are super important. For me, I don’t like to remember things so after I solve a problem, I like to set up a back stop where the test has my back, and when I hand off the code to somebody else or a future me, it’s still protected against these bugs. I’ll give you an example here, I’m a strong believer in integration tests and you should look in the Docker if you haven’t for these kind of tests.

For an example in KilnProxy, there’s a lot going on with Mercurial and with Git when they’re talking over SSH. What I set up was an environment in a Docker container, Docker Image which I run as a Docker container, which will actually make it get commands to a running server through my KilnProxy code, so I’m trying git pool, git clone, git push, all those things. Same with Mercurial. The nice thing about this is now maybe a new version of Git comes out, we can just have a separate instance of that container with a new version of Git and we could have all these tests running in parallel, always making sure that our proxy’s not broken.

还没有人翻译此段落

我来翻译

我来翻译

我来翻译

Uptime: preserved. As far as I know, I think the service has been, it’s been running for six months or so and I think it’s been restarted three or four times. I haven’t checked Wolfram Alpha, I believe that’s more than once a day. Things worked out really well and it was a good experience. Based on this, this was our first prototype, first production use of Go at Fog Creek. It was met with a lot of skepticism and since it’s been running for so long, months at a time without being restarted, it convinced everybody there that this is a technology worth exploring. It was a great experience for me and I want to thank you for listening.

还没有人翻译此段落

我来翻译

本文中的所有译文仅用于学习和交流目的，转载请务必注明文章译者、出处、和本文链接
我们的翻译工作遵照 CC 协议，如果我们的工作有侵犯到您的权益，请及时联系我们

有疑问加站长微信联系（非本文作者）

本文来自：开源中国翻译

感谢作者：

查看原文：使用 Go 构建 Resilient Services - 技术会谈【未翻译】

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

3297 次点击

加入收藏微博

收入我的专栏

上一篇：使用 gdb 工具调试 Go

下一篇：当 DNS 解析器遇到 Go fuzzer 【未翻译】

git

pprof

http

docker

0 回复

暂无回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

使用 Go 构建 Resilient Services - 技术会谈 【未翻译】

About Fog Creek Tech Talks

还没有人翻译此段落

Content and Timings

Transcript

Introduction

Background

还没有人翻译此段落

还没有人翻译此段落

About KilnProxy – SSH Reverse Proxy

还没有人翻译此段落

Results

还没有人翻译此段落

还没有人翻译此段落

Handling Errors

还没有人翻译此段落

还没有人翻译此段落

Channels

还没有人翻译此段落

Handling Panics

还没有人翻译此段落

还没有人翻译此段落

Avoiding Race Conditions

还没有人翻译此段落

Implementing Timeouts

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

Profiling

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

Logging

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

About Fog Creek Tech Talks

还没有人翻译此段落

Content and Timings

Transcript

Introduction

Background

还没有人翻译此段落

还没有人翻译此段落

About KilnProxy – SSH Reverse Proxy

还没有人翻译此段落

Results

还没有人翻译此段落

还没有人翻译此段落

Handling Errors

还没有人翻译此段落

还没有人翻译此段落

Channels

还没有人翻译此段落

Handling Panics

还没有人翻译此段落

还没有人翻译此段落

Avoiding Race Conditions

还没有人翻译此段落

Implementing Timeouts

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

Profiling

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

还没有人翻译此段落

使用 Go 构建 Resilient Services - 技术会谈【未翻译】

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏