<p>Source code: <a href="https://github.com/Marmeladenbrot/Crawler/tree/master/src/crawler" rel="nofollow">https://github.com/Marmeladenbrot/Crawler/tree/master/src/crawler</a> - to crawl a given website for all links with the same host on that site, getting a list of all links in the end.</p>
<ul>
<li>Problem 1: 4 worker find less than 10 or 100 worker</li>
<li>Problem 2: I don't know how I can check if the crawler has finished (to create the CSV File)</li>
</ul>
<p>Any advice is welcome to make it work and more goish :).</p>
<hr/>**评论:**<br/><br/>Fwippy: <pre><p>You've already got an open thread on the subject, please be patient.</p>
<p>Edit: Did you delete that?</p></pre>Fwippy: <pre><p>Anyways, a few obvious problems:</p>
<ul>
<li><p>You call <code>wg.Done()</code> every time a link is crawled, and <code>wg.Add(1)</code> each time you start a new one. What happens if there's ever items in the queue to be processed, but all your workers are between jobs? Your waitgroup will have 0 items in it, <code>wg.Wait()</code> will complete, and your code will exit prematurely.</p></li>
<li><p>You don't need to wait to write in a goroutine (<a href="https://github.com/Marmeladenbrot/Crawler/blob/master/src/crawler/main.go#L52" rel="nofollow">here</a>), and then you don't need to wait for Stdin to finish up - it'll exit once it's complete.</p></li>
</ul></pre>Yojihito: <pre><ul>
<li><p>Yes, that's a problem I don't know how to fix yet</p></li>
<li><p>I don't know what you mean with the "wait to write", I don't see anything like that at lione 52?</p></li>
</ul>
<p>Small code update pushed to git.</p></pre>Fwippy: <pre><p>I mean, that goroutine doesn't need to exist, you can just leave that code in main.</p>
<p>What if instead of incrementing the waitgroup on job start, you added when you added the URL to the queue?</p>
<p>That way, the waitgroup reflects "this is how many jobs that have been submitted but have not yet completed" rather than "this is how many jobs are currently processing."</p>
<p>You'd need to make a little modification to where you call <code>wg.Done()</code>; as you want to make sure that it doesn't hit 0 momentarily in between. It's still early and I haven't had my second coffee yet, but I'd suggest changing output to <code>chan[]string</code>, and then structuring your adding-to-queue code like:</p>
<pre><code>go func() {
for links := range output {
for _, link := range links {
if visited[link] == false {
visited[link] = true
wg.Add(1)
input <- link
}
}
wg.Done()
}
}()
</code></pre></pre>Yojihito: <pre><p>Yes, I remodeled the entire program to get a worker+pool architecture (as suggested by a nice person in the thread) so I assumed a new thread would be the best because the comments of the old one were based on the old code.</p>
<p>Went down from 8,6mb to 6,6mb so a big chunk of code was changed or deleted.</p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传