Limiting concurrent HTTP Requests?

agolangf · · 136 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>Hi <a href="/r/golang">/r/golang</a>,</p> <p>I started learning Go a few months ago because I built a Python scraper to scrape the .org zone for specific types of sites, and found Python&#39;s concurrency speed really lacking. My Python scraper/parser can handle about 10,000 requests per minute.</p> <p>.</p> <p>I&#39;m at the point where I&#39;m ready to start building this in Go but one of the big questions I have is how many requests go can handle concurrently and limiting the number of requests/goroutines so it doesn&#39;t spin out of control. </p> <p>.</p> <p>Do any of you have some insight into this? Even just some links to helpful articles/info would be very helpful. </p> <p>.</p> <p>Thank you!</p> <hr/>**评论:**<br/><br/>echophant: <pre><p>It&#39;s very unlikely that the number of goroutines will be your limiting factor here, you can spin up hundreds of thousands without issue on a low-powered laptop. If you still want to limit the number of concurrent requests, you can <a href="https://golang.org/pkg/net/#TCPListener.Accept" rel="nofollow">manually accept requests</a> and pass them to a pool of worker goroutines.</p> <p>I&#39;d test out the standard <a href="https://golang.org/pkg/net/http/" rel="nofollow">net/http</a> package first, and then optimize if you need to.</p></pre>Dat_Nig_Slim_Shady: <pre><p>I might not have worded my question well. I will be spinning off goroutines for each request, and want to limit my HTTP requests by limiting the number of goroutines. </p> <p>.</p> <p>Not concerned about too many goroutines, I&#39;m concerned about too many requests and how many is reasonable to have open at any given time. </p></pre>bkeroack: <pre><p>One pattern to accomplish this is to use a buffered channel with N goroutine &#34;workers&#34; consuming items from that channel, where N is your desired max number of outstanding HTTP requests.</p></pre>thepciet: <pre><p>Goroutines are meant to be cheap on top of how the operating system does the concurrency, so you may want to look at the open source implementation for your platform, if &#34;big money&#34; detail makes such detail matter. Otherwise my guess is the operating system code and the networking hardware will be the major choices for performance, not Python vs Go.</p></pre>Dat_Nig_Slim_Shady: <pre><p>The asyncio event-loop in Python is just not that good yet, goroutines are considerably faster. </p> <p>.</p> <p>The number of requests I need to complete would take like a year to run in Python compared to a couple days in go based on my very crude estimates.</p></pre>srikanthegdee: <pre><pre><code>package main import ( &#34;fmt&#34; &#34;sync&#34; ) const ( MAX_WORKERS = 10 // Maximum worker goroutines HOLDING_CAPACITY = 30 // Holding capacity of the channel ) type Scrapper struct { Url string } func (s *Scrapper) Scrap() { // Scrap will do the heavy lifting of scraping the urls one at a time fmt.Printf(&#34;Scrapped %v \n&#34;, s.Url) } //Below is the endless list of incoming urls from the internet. var list = []string{&#34;google.com&#34;, &#34;yahoo.com&#34;, &#34;reddit.com&#34;, &#34;golang.org&#34;, &#34;js4.red&#34;} func main() { urls := make(chan *Scrapper, HOLDING_CAPACITY) var wg sync.WaitGroup for i := 0; i &lt; MAX_WORKERS; i++ { wg.Add(1) go func() { for url := range urls { url.Scrap() } wg.Done() }() } for i := 0; i &lt; len(list); i++ { urls &lt;- &amp;Scrapper{Url: list[i]} } close(urls) wg.Wait() } </code></pre> <p><a href="https://play.golang.org/p/PToaT_Gs9T" rel="nofollow">https://play.golang.org/p/PToaT_Gs9T</a></p></pre>olebedev: <pre><p>Take a look this library - <a href="https://godoc.org/golang.org/x/time/rate" rel="nofollow">https://godoc.org/golang.org/x/time/rate</a>. I use for scraper throttling.</p></pre>broady: <pre><p>I agree, this is the best baseline. From this, you can choose to queue requests or drop them.</p></pre>neopointer: <pre><p>If you started to learn go a few months ago and you know already how channels work, then I would suggest that you use a buffered channel to limit the concurrent requests that your web scrapper can do. How?</p> <p><a href="https://play.golang.org/p/1a9PKptOO6" rel="nofollow">https://play.golang.org/p/1a9PKptOO6</a></p></pre>Dat_Nig_Slim_Shady: <pre><p>So for each goroutine you launch, you&#39;re occupying a channel slot, then freeing it on completion of the goroutine? </p> <p>.</p> <p>What&#39;s the purpose of using a blank struct as opposed to something else? Not challenging it, just don&#39;t understand.</p></pre>neopointer: <pre><blockquote> <p>So for each goroutine you launch, you&#39;re occupying a channel slot, then freeing it on completion of the goroutine?</p> </blockquote> <p>Yep! That way you limit quantity of goroutines running in parallel</p> <blockquote> <p>What&#39;s the purpose of using a blank struct as opposed to something else? Not challenging it, just don&#39;t understand.</p> </blockquote> <p>By using an empty struct you&#39;re using a lot less memory than if you use, say, a boolean. Since you are not exchanging information through the channel, there&#39;s no need to use any concrete type for it.</p></pre>Dat_Nig_Slim_Shady: <pre><p>Interesting, very cool. </p> <p>.</p> <p>One last question, is feeding to a full buffered channel a lock? Like it just pauses the operation until the channel has an open operation?</p> <p>.</p> <p>So I could just have a goroutine running a loop that reads the next url from my CSV zone file and waits to feed it?</p></pre>neopointer: <pre><blockquote> <p>One last question, is feeding to a full buffered channel a lock? Like it just pauses the operation until the channel has an open operation?</p> </blockquote> <p>Exactly. It will block the current goroutine until there&#39;s space in the channel. In this case the goroutine is the main one.</p> <blockquote> <p>So I could just have a goroutine running a loop that reads the next url from my CSV zone file and waits to feed it?</p> </blockquote> <p>You can have, say, the main goroutine reading the file and start one goroutine for each URL, for example. You could even optimize the logic so that, even you ran out of goroutines to handle the URLs, you would still keep reading the file.</p></pre>dobegor: <pre><p><a href="https://github.com/juju/ratelimit" rel="nofollow">https://github.com/juju/ratelimit</a> - token bucket implementation.</p> <p>take a look at creating middleware using this: <a href="https://github.com/go-kit/kit/blob/master/ratelimit/token_bucket.go#L20" rel="nofollow">https://github.com/go-kit/kit/blob/master/ratelimit/token_bucket.go#L20</a></p> <p>you just need to create a middleware for your http requests.</p> <p>I recommend looking into chi: <a href="https://github.com/pressly/chi" rel="nofollow">https://github.com/pressly/chi</a></p> <p>Look at &#39;ArticleCtx&#39; function in README example. It&#39;s a simple middleware example which is compatible with standard <code>net/http</code> stack.</p> <p>EDIT: I&#39;m sorry, I was thinking you need to limit incoming HTTP requests. Though I&#39;ll leave this comment as is if someone needs that too.</p></pre>meehow808: <pre><p>There is a wrapper which allows to limit number of total requests or number requests per host: <a href="https://godoc.org/github.com/kr/http/limit" rel="nofollow">https://godoc.org/github.com/kr/http/limit</a></p></pre>TheMerovius: <pre><blockquote> <p>My Python scraper/parser can handle about 10,000 requests per minute.</p> </blockquote> <p>That is really low (it&#39;s 167 QPS, I would expect most servers to be able to handle at least an order of magnitude more). Have you looked into what the limiting factor actually is? It might just be network bandwidth or something.</p> <p>If you are scraping, it is definitely a good idea to add limits anyway. For a simple limiting of concurrency, I like <a href="https://play.golang.org/p/bp5fyqsgwK" rel="nofollow">this very simple pattern</a> - simply add a Lock/Unlock pair like with a mutex, it will make sure that at most n functions can hold a Lock at any time.</p> <p>But I&#39;m not sure concurrency is what you actually want to limit. It would still mean that you would not be a nice net-citizen, because you don&#39;t actually limit traffic. You want to limit actual QPS and you also want to add exponential backoff - the service owner will thank you for it. There are a bunch of go libraries out there that will help you with that. For extra coolness, do this separately per domain or something.</p> <p>Even if you don&#39;t care about being nice; there really is no use in limiting concurrency. The runtime, http library, OS networking stack and the available resources will do a much better job utilizing the resources you have. What would you hope to achieve by applying this limit?</p></pre>TheMerovius: <pre><p>And, fwiw:</p> <blockquote> <p>one of the big questions I have is how many requests go can handle concurrently and limiting the number of requests/goroutines so it doesn&#39;t spin out of control.</p> </blockquote> <p>I don&#39;t think there is any actual limit here. This will be dominated by networking resources, available file descriptors, CPU time and everything else long before it will be limited by any overhead of goroutines. Just spin off a goroutine per request, it&#39;s the simplest code and then you don&#39;t have to worry. How many requests that will give you will largely depend on the CPU/Network/RAM you can allocate to the process, but I can&#39;t imagine that even fairly naive code on cheap commodity hardware won&#39;t be able to far outpace the 167 QPS you are getting right now.</p></pre>Dat_Nig_Slim_Shady: <pre><p>Well all said and done this scraper will be doing over a billion requests, if I could just spin those off all at once that would be awesome but I can&#39;t imagine that&#39;s possible so I need to limit it to whatever the computer/go can reasonably handle.</p> <p>.</p> <p>I used threads in Python to handle the requests and as I sped it up I ran into errors where there were too many open files. The asyncio package in Python is quite a bit faster but its confusing and still isn&#39;t nearly as fast as I&#39;d like so I decided to learn go.</p></pre>TheMerovius: <pre><blockquote> <p>Well all said and done this scraper will be doing over a billion requests, if I could just spin those off all at once that would be awesome but I can&#39;t imagine that&#39;s possible so I need to limit it to whatever the computer/go can reasonably handle.</p> </blockquote> <p>AFAIK a goroutine will use ~2K of RAM initially, so yes, a billion would probably be too much, yes. A million likely wouldn&#39;t be a problem, though, RAM is the only limitation you need to worry about, AFAIK. So yes, this might be one of the few cases where the number of work items is large enough to justify a worker-pool.</p></pre>Dat_Nig_Slim_Shady: <pre><p>Do you know how I would go about testing the limits of my system so I know how far I can throttle it up? </p> <p>.</p> <p>I don&#39;t fully understand how requests work at a low level but I think there might be an issue with the number of open sockets sending/receiving info at any given time too, do you know if that could be a potential issue either?</p></pre>tmornini: <pre><p>Check memory usage and CPU limitations and adjust your WaitGroup banding accordingly.</p> <p>A huge number of routines may also get network limited, particularly if your network bandwidth is limited...</p></pre>karma_vacuum123: <pre><p>at that number of requests, you stand a decent chance of having your IP blocked, not all hosts are permissive about being crawled. every ops team i have worked with has spent some part of their day banning request flooders. VPS providers will also ban your account as this type of behavior is associated with malware hosters</p> <p>distribute the load across a number of hosts that are not topographically adjacent (in same colo, etc). also respect http expires headers and use content digests to limit the number of times you retrieve the same stuff over and over (a major &#34;script kiddie&#34; mistake)</p> <p>also, get a new user name, seriously</p></pre>Dat_Nig_Slim_Shady: <pre><p>I&#39;m not crawling, just scraping the home page. I&#39;m only making one request per website in my zone files. I&#39;ve already made over 10 million requests with my Python scraper and haven&#39;t had any issues with my ISP (fingers crossed)</p></pre>tmornini: <pre><blockquote> <p>this scraper will be doing over a billion requests</p> </blockquote> <p>Feed the request details into a queue like AWS SQS and write a worker to read the queue and make the request in a goroutine, using a WaitGroup as described above to make sure you don&#39;t plow the machine under.</p> <p>This way you can spread the requests over a bunch of small inexpensive servers like AWS t2 instances.</p></pre>j_d_q: <pre><p><a href="https://play.golang.org/p/DMdICCMpC6" rel="nofollow">Here&#39;s the pattern I use</a></p> <p>I like this approach because it makes it easy to add functionality without adding much complexity. It&#39;s also very composable (<code>Cache(RateLimit(LimitConcurrency(...</code>)</p> <p>Edit: slightly off topic. I skimmed your message and thought you wanted to limit the concurrency to your scraper. Go is great at concurrency, but it can be hard to wrap your head around if you&#39;re really used to synchronous code. But the http handler is concurrent by default so you don&#39;t have to do anything special</p> <p>Edit edit: now I see why. Your title specifically refers to limiting concurrency</p></pre>
136 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传