How do you update a production server?

xuanbao · 2015-05-21 19:43:25 · 1220 次点击

这是一个分享于 2015-05-21 19:43:25 的资源，其中的信息可能已经有所发展或是发生改变。

For example, if I write a small server that runs on port 9090 and then I want to upgrade it with minimum down time. What are my options?

Thanks

评论：

Taikumi:

The basic answer is that you'll need to put it behind a load balancer (i.e. HAProxy) and have it route requests to multiple server instances. You then can do a rolling deploy to take down one instance at a time and swap it with the newer updated code. Ideally this should provide you with 100% uptime throughout the upgrade.

DarkRye:

Can nginx web server be used for this purpose? What about Apache?

rcklmbr:

Yes, nginx, httpd, and haproxy can all be used to do that. Alternatively, if you dont want to manage your own and are using EC2, you can use Amazon ELB or Google Cloud has one too.

mwholt:

I'd love for more people to put Caddy through its paces. If you or anyone tries it for load balancing, do let me have your feedback. Load balancing docs

ketralnis:

haproxy is really, really good and I'd definitely recommend giving it a shot first.

Whatever load balancer you use, it needs to be able to reliably determine which backends are up so it knows where it can send requests. This is usually called health checking. If nginx and apache can do that, they'll work just fine.

BraveNewCurrency:

One problem with HAProxy is that changing the configuration requires a restart, which is not zero downtime.

Nginx is zero downtime, but the tools for troubleshooting/logging have been more primitive. (The Nginx company recently did some work here, so maybe it's better.)

BlueDragonX:

The way you restart haproxy isn't a traditional restart - it's more of a replacement. The new haproxy instance takes over listening for connections while the old one completes requests on open connections. When those connections close the old instance shuts down. This results in no down time for haproxy.

I manage multiple containerized haproxy instances to load balance traffic in a production CoreOS cluster. Their configuration is entirely dynamic and they restart using this method. They never lose connections.

The infrastructure for doing all of this is written in Go and some of the load balanced applications are as well.

BraveNewCurrency:

This results in no down time for haproxy.

Citation needed.

It's really easy to test: Pound your server with a trivial request, and do a (soft)restart of haproxy. Count the "could not connect to server" errors.

If there was no problem, why does the HAProxy home page cite this extremely complicated work-around to the problem? http://engineeringblog.yelp.com/2015/04/true-zero-downtime-haproxy-reloads.html

BlueDragonX:

I don't deny that's the case, and perhaps for services that make such frequent connections you'll notice it, but for our use cases it has not been an issue. We do push quite a bit of traffic through our LB's and neither our clients or our monitoring systems indicate a problem.

BraveNewCurrency:

neither our clients or our monitoring systems indicate a problem.

Right. There is no error logged when HAProxy isn't listening, so how would you know?

If a client gets a single error, and the site is still up immediately after, would they call you? (Really?) Even if they did, would anyone "connect the dots" to haproxy? (If it's out to lunch for 100ms and you get 100 requests per second, you will average 10 errors per restart. If your site has 300 active users at a time, they will only see one error every 30 restarts, on average. )

Assuming haproxy is out to lunch for 100ms, and your monitoring checks every minute, monitoring won't see the problem 599 out of every 600 restarts. If you restart haproxy every day, it will take 1-2 years for your monitoring system to notice the problem.

Most monitoring systems won't alert unless you are down for a few seconds. So it's likely that your monitoring will never alert you to the problem.

I'm not saying you have to address the problem, only that you have to understand the trade-offs. You can restart every day and still claim "five nines" of reliability (assuming nothing else goes wrong...). But you can't ever claim 100% uptime.

UptownFunkLyrics:

I'm too hot!

Hot Damn!

santicl:

nginx works great to do it. It reloads almost instantly.

I have this in my nginx.conf:

   location / {
            proxy_pass http://localhost:8080;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $host;
    }

You just have to start your new go server in a diferent port, update nginx.conf and reload.

syf81:

As an alternative to what's already been mentioned here, if your OS supports SO_REUSEPORT, you could just make your server bind to that port with that socket option.

Upgrading it with minimal downtime is then as easy as launching your new server, and gracefully shutting down the old one, it can for example listen to a signal that closes the listen socket, finish whatever requests it's processing and then exit.

mwholt:

This sounds really cool. Where would I go to learn how to do this in Go specifically?

syf81:

There's no proper implementation of it in the net library yet, but basically you have to use syscall.SetsockoptInt in the os library with the appropriate file descriptor and socket option, which is pretty much how src/pkg/net/sockopt_linux.go works.
However if you do want to use it but don't want to go in too deep, there's a package available that does this.
https://github.com/jbenet/go-reuseport

docsavage:

You might consider using goagain: https://github.com/rcrowley/goagain

Example of zero-downtime HTTP server restarts using the manners HTTP server: https://github.com/cupcake/mannersagain

I plan on using this with a goji-based HTTP server.

arschles:

Some people have talked about the DevOps (or whatever we're calling it now) perspective - just want to expand on it.

I'm assuming your app is stateless, like a frontend. If you're running a frontend (like an HTTP server) that's not stateless, try to make it stateless. It'll be a lot easier to scale that way.

Anyway, with that out of the way, here's a list of steps to run and upgrade a service with no downtime. It has the added benefit of redundancy and fault tolerance. The cost of doing that is real money - you have to run more than 1 server. It also features some buzz technologies, but I actually like them so here goes...

Put your app in a Docker image (or your favorite container format. VMs work here too, but they're pretty heavy).
Run it on 2+ nodes, preferably in different "datacenters" (or if you're running in a cloud, whatever your cloud calls them) to gain some failure tolerance
Create a load balancer that routes to your 2+ nodes. Prefer a LB with an API like Google Cloud's or AWS's, so you can tell it to stop routing to a node before you take down that node.
Tell your LB to stop routing to node A.
Pull the new docker image with your new server to node A
Run the new image on node A
Tell the LB to start routing to node A again
Watch the logs/stats/anything else on node A
If everything looks good on node A, then do the same deploy for node B. Otherwise do the same deploy to roll back node A. Node A is called your "canary" release in this case. There are a few more gotchas to canary releases that are out of the scope of this comment.

Note that if you're running a load balancer that doesn't have an API (my understanding is that Nginx and HAProxy don't), it may be ok to let node A fail out. I'm not familiar enough with those LBs to give a definitive answer.

Also note that there are systems to automate this process. For example, Kubernetes has rolling deploys that do a similar process, without the full watch-the-logs process I talked about. CoreOS has a facility to deploy to only nodes with specific "labels" that you define, which you can, with some added glue to deal with the LBs, use to do canary releases with no downtime. I use CoreOS and I'm trying to automate this whole no-downtime+canary process now. Right now it's all manual...

Ok, I've gotten a little off track but hopefully this comment shows that once you've solved your initial problem of no-downtime deploys, you can expand it to do more things to make your deploys even more reliable.

Good luck.

jobenjo:

Fwiw, after looking into this for awhile, I decided it wasn't worth the trouble (running a consumer app). During deploy, the downtime is still just momentary (<100ms), and the app is built withstand random network errors, anyway.

perihelion9:

You've already had load balancers and traditional answers given to you, but i've found that there are always operational problems with them. You need to know the hosts and ports of all your webservers, configuration can get out of sync, balancing strategies can get wiry, and hosting multiple (small) services multiplies the problem. You can start layering more tools on top, like Serf to have cluster awareness and automatic load balancer addition/removal, but it gets ridiculous after a time - you should only need that once you reach a scale whereby you need that, not for rollbacks, upgrades, or poking around.

I've started advocating your public-facing servers never go down, only hotload configurations. Those servers do nothing except forward the parsed user request to a message broker (such as Rabbit), and from there your actual worker processes (which live elsewhere) can consume the messages, process them, and put another message back on the message queue to be delivered to the webserver, which serializes it and sends it to the client.

This means you never functionally touch your webserver (only the route mappings it has, which are hotloaded), and you can feel free to roll forward or back on your worker processes without repercussion - they just connect to Rabbit and begin processing requests when they're up.

In general this increases durability of your whole system, and makes it so that you will almost never drop a user connection, regardless of what you're doing with the business logic in your worker processes.

mioelnir:

This is also not either/or to loadbalancers etc - those you can still put before the public facing servers. Admittedly, that is then usually a whole different ballgame with regards to scale.

perihelion9:

Yeah, you do need them for scale, but not for deployments. I added this line not long after posting:

you should only need that once you reach a scale whereby you need that, not for rollbacks, upgrades, or poking around.

Mteigers:

Do you find that rabbit is fast enough to essentially mimic an entire web stack?

perihelion9:

Rabbit is only the message broker, and it's extremely quick, so yes it's fine. Splitting off worker processes (or calls to other internal APIs) are common in most setups anyway, the extra in-rack (or on-node) round-trips to Rabbit are negligible. Rabbit just simplifies everything to a pub/sub model rather than peer discovery (or DNS) in-network.

And anyway, this setup dramatically increases possible throughput-per-webserver (since the webservers exist only to parse requests and serialize responses). With the right nonblocking server setup, you can have a very linear rate of resource consumption on the webservers relative to the number of connections received per second (as opposed to traditional jetty/apache/etc setups, whereby resources consumed increase closer to logarithmically to open connections). Meaning you can serve more clients at a faster average rate, even if individually every request takes imperceptibly longer.

iends:

Do you have more reading material about this kind of architecture?

pinpinbo:

We version every binary with SemVer number scheme, change symlink to point to the new one, and do rolling deploy.

Extremely simple. I would go out of limbs to say that this is one of Go's biggest strength.

fenduru:

What aspect of this is a strength of Go?

ericanderton:

If I had to guess: Probably that most applications compile to a single executable making a filesystem based cut-over like this extremely easy.

anoobisus:

Versus... Renaming a directory that gasp might have more than one file inside.

Everyone has their half passed approaches to deployment and too many people are proud of how hacky and unscaleable they are.

Shrug

CapoFerro:

That assumes you have your app's dependencies 100% vendored into that directory that you are renaming or relinking. In Ruby or Python, for example, if your dependencies change between versions and you don't have everything vendored, it's often impossible to do what you describe.

anoobisus:

.... What? You mean your deployment might involve randomly updating nonvendored app dependencies?

That thing I said about hacky deployments.

If you can't blow away a prod machine and rebuild it instantly. You're doing it wrong

Vendor it
Build a redeployable vagrant, packer, ec2 disk image with deps pinned.
spend 5 minutes and throw it in a docker container with pinned deps.

I mean... If you're deploying a bare bones app and doing an npm restore on the server... Then you're not done deploying until you've restored... At which point atomicity applies.... Which of course you can't do if you rely on global environment state installing to some global rubyenv.

But then again virtual ruby and python and npm environments exist for very, very, very, very, I'm typing this by thumb, very very very important reasons.

And of course ignored by people because Hey my hacky deployment start works fine for us! Hope we never pull a bad package on the server.

Sure hope, you know, GitHub doesn't go down and we lose the ability to deploy our app ( which happens and is stupidly almost proudly announced on twitter)

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

1220 次点击

加入收藏微博

nginx

github

apache

docker

0 回复

暂无回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

How do you update a production server?

用户登录

今日阅读排行

一周阅读排行

最新主题