<p>Hi Everyone,</p>
<p>A few days ago, I'm decide rewrote <a href="https://github.com/antchfx/antch" rel="nofollow">Antch</a> project, Its open source web crawler framework that inspired by <a href="https://scrapy.org/" rel="nofollow">Scrapy</a> project.</p>
<p>The Antch had two core components: <strong>Middleware</strong> and <strong>Item Pipeline</strong>, the Middleware component is for HTTP download(etc, robots.txt, cookies, gzip), the Item Pipeline Component is
for to process received Item data that from the Spider Handler. </p>
<p>the built-in Middleware include:</p>
<ul>
<li>robots.txt</li>
<li>proxy(HTTP,HTTPS,SOCKS5)</li>
<li>cookies</li>
<li>gzip</li>
</ul>
<p>Everything can make as Middleware if you want, its easy to extensible.</p>
<p>This jpg file overview the whole architecture design, <a href="https://raw.githubusercontent.com/wiki/antchfx/antch/imgs/antch_architecture_01.png" rel="nofollow">view jpg</a>, like the scrapy architecture.</p>
<p>Project: <a href="https://github.com/antchfx/antch" rel="nofollow">https://github.com/antchfx/antch</a></p>
<p>The next plan is build a distributed web crawler that building on the Antch project.</p>
<p>Does anyone have any idea or suggestions?</p>
<hr/>**评论:**<br/><br/>flatMapds: <pre><p>I have done something similar before, in my security research. </p>
<p>Well honestly a big thing you should invest in for a distributed web crawler is scheduling / lb algorithms built into the clients, I followed this variation of P2C <a href="https://cs.stanford.edu/%7Ematei/papers/2013/sosp_sparrow.pdf" rel="nofollow">https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf</a> , also another thing that saves up a good bit of work, as you go along, is writing an endpoint that has a bloom filter to check if a link is found before it's submitted to the workers, preferabley one that is mergable along with any other state, also for the sake of avoided duplicate work being sent instead of simply replicating requests to N masters to trigger the state change, I suggest you should just have them periodically pull eachother to merge such state. </p>
<p>As for what kind of state you may have, worker membership, bloom filter, and master membership.</p></pre>menuvb: <pre><p>Thanks for you suggestion, this project just begin starting, and lack of some middleware/Components, such as filter duplicate URLs, HTTP link depth tracking,etc... </p>
<p>Before start distributed web crawler(if i'm still want), need a lot of time to made this framework more availability and extensible.</p></pre>matiasbaruch: <pre><p>Interesting, I've recently started to play around the idea of a web crawling platform (something similar to ScrapingHub or Apify but completely open source), using Go. I won't provide a framework for implementing the crawlers so the goal is to support whatever you need to deploy/orchestrate, collect the data from them, etc., even if they're written in different languages.</p>
<p>In case you're curious you may find it here: <a href="https://github.com/zcrawl/zcrawl" rel="nofollow">https://github.com/zcrawl/zcrawl</a></p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传