My crawler doesn't want to crawl - could you help me?

polaris · · 799 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I made this as my first Go project - <a href="https://github.com/Marmeladenbrot/Crawler/tree/master/src/crawler">https://github.com/Marmeladenbrot/Crawler/tree/master/src/crawler</a> - to crawl a given website for all links with the same host on that site, getting a list of all links in the end.</p> <p>But even for a site like <a href="http://www.example.de">www.example.de</a> which has 17 unique links I get something between 10-16 links, while the Python crawler from a friend take ages but the results are accurate.</p> <p>Is there also a way to limit the number of connections to prevent ddosing a website? The number of goroutines is a bad indicator but I didn&#39;t know any better.</p> <p>Any help gets a cookie and is very appreciated (in the title it should say &#34;could you please help me&#34; but I can&#39;t change it anymore) :).</p> <hr/>**评论:**<br/><br/>bcgraham: <pre><p>There&#39;s a lot to fix and I&#39;m on my phone, but what happens if the number of goroutines is more than 30? The dequeued item is discarded. </p></pre>Yojihito: <pre><blockquote> <p>There&#39;s a lot to fix</p> </blockquote> <p>Hopped so :).</p> <blockquote> <p>what happens if the number of goroutines is more than 30? The dequeued item is discarded.</p> </blockquote> <p>Without the &#34;if runtime.NumGoroutine() &lt; 30 {&#34; the results are the same? </p> <p>And I thought it waits until the number of goroutines is lower than 30 and goes on, resulting in a lower number of simultaneous connections (was my hope).</p></pre>TheMerovius: <pre><p>The source code doesn&#39;t build. collectLinks is not defined.</p></pre>Yojihito: <pre><p>Sorry, didn&#39;t get pushed for unknown reasons. Fixed :).</p></pre>TheMerovius: <pre><p>A couple of observations, without compiling and testing the code:</p> <ul> <li>You split the code up into lots and lots of files. It&#39;s a 400 line project, it&#39;s fine to do that in a single file :) In go, you usually don&#39;t split up your source quite as much, people will be thankful if they can find a function in the same file it&#39;s used :)</li> <li>You use a sync.RWMutex, but only [Un]Lock() it.</li> <li>You mix channels and Locks a lot. It seems a lot more complicated, than it needs to be, imho. A better architecture (that also solves your concurrent connection problem) would be, to spawn a fixed number of workers, that all read from a common queue-channel and write all found links to an output channel. A separate goroutine owns the map and reads from the output channel, looks them up (lock free, as it owns the map) in the visited map and if it hasn&#39;t been visited, writes it to the queue channel. This should make your code much simpler and make it easier to argue about.</li> <li>nit: <code>fmt.Println(&#34;AbsoluteURL: &#34; + absoluteUrl)</code> has the same output as <code>fmt.Println(&#34;AbsoluteURL:&#34;, absoluteURL)</code>. You should prefer the latter.</li> <li>You possibly have a race with wg. Imagine the following: Crawl passes a url into queue and returns. Crawl thus calls wg.Done(). Before the goroutine of main runs again and calls wg.Add(1), the separate goroutine runs and wg.Wait() returns (as all crawlers called done). It closes queue prematurely. I am not sure about this being possible or the case, but if you get inconsistent results, this points to a race condition anyway. Have you tried running with the race detector enabled?</li> </ul></pre>Yojihito: <pre><blockquote> <p>You split the code up into lots and lots of files.</p> </blockquote> <p>Coming from Java splitting up the code into pieces allows it to have smaller code pieces doing one thing, having a faster view about everything. I don&#39;t see that as a problem but as a solution instead of a 400 line file, more than 50 SLOC in one file drives me crazy :).</p> <blockquote> <p>You use a sync.RWMutex, but only [Un]Lock() it.</p> </blockquote> <p>What else should I do with it but lock() and unlock()?</p> <blockquote> <p>spawn a fixed number of workers</p> </blockquote> <p>Interesting, the whole paragraph seems much more goish and cleaner than my code and would solve some, maybe all issues. I have no idea how to build a worker but I will google a bit around to inform me.</p> <blockquote> <p>You possibly have a race, have you tried running with the race detector enabled?</p> </blockquote> <p>Will do, thanks for the tip.</p></pre>TheMerovius: <pre><blockquote> <p>Coming from Java splitting up the code into pieces allows it to have smaller code pieces doing one thing, having a faster view about everything. I don&#39;t see that as a problem but as a solution instead of a 400 line file, more than 50 SLOC in one file drives me crazy :).</p> </blockquote> <p>I&#39;m just telling you, what&#39;s idiomatic in go. And that I find it unnecessarily cumbersome to read your code because I have to switch files all the time.</p> <blockquote> <p>What else should I do with it but lock() and unlock()?</p> </blockquote> <p>R[Un]Lock(). That&#39;s the difference between a sync.Mutex and a sync.RWMutex :) Either you need RLock, or you should just use a sync.Mutex.</p></pre>Yojihito: <pre><p>For the data race, I get this all over the place and have no idea what it means:</p> <blockquote> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> </blockquote> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <p>invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0 -1</p> <blockquote> <p>Found 1 data race(s) exit status 66</p> </blockquote></pre>TheMerovius: <pre><p>Me neither. How are you running it? If I try with <code>go run -race *.go</code> I get the output:</p> <pre><code>================== WARNING: DATA RACE Read by goroutine 16: main.main.func3() /tmp/Crawler/src/crawler/main.go:101 +0x67 Previous write by main goroutine: main.main() /tmp/Crawler/src/crawler/main.go:96 +0x65e Goroutine 16 (running) created at: main.main() /tmp/Crawler/src/crawler/main.go:103 +0x71d </code></pre> <p>Not sure what&#39;s the problem yet, though.</p></pre>TheMerovius: <pre><p>Ah, I see the problem. You use uri in a separate goroutine.</p></pre>TheMerovius: <pre><p><a href="http://p.nnev.de/7129" rel="nofollow">Patch</a></p></pre>TheMerovius: <pre><p>Still get inconsistent results, though.</p></pre>Yojihito: <pre><p>I get the same code as you above + the &#34;invalid spdelta __tsan_go_start 0x7a41a0 0x7a41eb 0x0&#34; with</p> <blockquote> <p>go run -race main.go urlTest.go csv.go crawl.go collectLinks.go checkSitegroup.go</p> </blockquote> <p>but with your go run -race *.go I get:</p> <blockquote> <p>GetFileAttributesEx *.go: The filename, directory name, or volume label syntax is incorrect.</p> </blockquote></pre>TheMerovius: <pre><p>Hm, the proposed channel-architecture has the obvious problem that you have a circular data dependency, which might lead to deadlocks… Don&#39;t know how to best solve this problem right now.</p></pre>Yojihito: <pre><p>Wouldn&#39;t the worker pool architecture solve this issue?</p></pre>captainju: <pre><p>I did something similar as my first project, a google search crawler It&#39;s not far better, but maybe you can find something usefull in it. <a href="https://github.com/captainju/goGoogleSearch/blob/master/Search.go" rel="nofollow">https://github.com/captainju/goGoogleSearch/blob/master/Search.go</a></p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

799 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传