The number of workers influences the number of found links in my web crawler?

blov · · 665 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>Source code: <a href="https://github.com/Marmeladenbrot/Crawler/tree/master/src/crawler" rel="nofollow">https://github.com/Marmeladenbrot/Crawler/tree/master/src/crawler</a> - to crawl a given website for all links with the same host on that site, getting a list of all links in the end.</p> <ul> <li>Problem 1: 4 worker find less than 10 or 100 worker</li> <li>Problem 2: I don&#39;t know how I can check if the crawler has finished (to create the CSV File)</li> </ul> <p>Any advice is welcome to make it work and more goish :).</p> <hr/>**评论:**<br/><br/>Fwippy: <pre><p>You&#39;ve already got an open thread on the subject, please be patient.</p> <p>Edit: Did you delete that?</p></pre>Fwippy: <pre><p>Anyways, a few obvious problems:</p> <ul> <li><p>You call <code>wg.Done()</code> every time a link is crawled, and <code>wg.Add(1)</code> each time you start a new one. What happens if there&#39;s ever items in the queue to be processed, but all your workers are between jobs? Your waitgroup will have 0 items in it, <code>wg.Wait()</code> will complete, and your code will exit prematurely.</p></li> <li><p>You don&#39;t need to wait to write in a goroutine (<a href="https://github.com/Marmeladenbrot/Crawler/blob/master/src/crawler/main.go#L52" rel="nofollow">here</a>), and then you don&#39;t need to wait for Stdin to finish up - it&#39;ll exit once it&#39;s complete.</p></li> </ul></pre>Yojihito: <pre><ul> <li><p>Yes, that&#39;s a problem I don&#39;t know how to fix yet</p></li> <li><p>I don&#39;t know what you mean with the &#34;wait to write&#34;, I don&#39;t see anything like that at lione 52?</p></li> </ul> <p>Small code update pushed to git.</p></pre>Fwippy: <pre><p>I mean, that goroutine doesn&#39;t need to exist, you can just leave that code in main.</p> <p>What if instead of incrementing the waitgroup on job start, you added when you added the URL to the queue?</p> <p>That way, the waitgroup reflects &#34;this is how many jobs that have been submitted but have not yet completed&#34; rather than &#34;this is how many jobs are currently processing.&#34;</p> <p>You&#39;d need to make a little modification to where you call <code>wg.Done()</code>; as you want to make sure that it doesn&#39;t hit 0 momentarily in between. It&#39;s still early and I haven&#39;t had my second coffee yet, but I&#39;d suggest changing output to <code>chan[]string</code>, and then structuring your adding-to-queue code like:</p> <pre><code>go func() { for links := range output { for _, link := range links { if visited[link] == false { visited[link] = true wg.Add(1) input &lt;- link } } wg.Done() } }() </code></pre></pre>Yojihito: <pre><p>Yes, I remodeled the entire program to get a worker+pool architecture (as suggested by a nice person in the thread) so I assumed a new thread would be the best because the comments of the old one were based on the old code.</p> <p>Went down from 8,6mb to 6,6mb so a big chunk of code was changed or deleted.</p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

665 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传