new web crawler framework,any idea?

xuanbao · · 19 次点击    
<p>Hi Everyone,</p> <p>A few days ago, I&#39;m decide rewrote <a href="https://github.com/antchfx/antch" rel="nofollow">Antch</a> project, Its open source web crawler framework that inspired by <a href="https://scrapy.org/" rel="nofollow">Scrapy</a> project.</p> <p>The Antch had two core components: <strong>Middleware</strong> and <strong>Item Pipeline</strong>, the Middleware component is for HTTP download(etc, robots.txt, cookies, gzip), the Item Pipeline Component is for to process received Item data that from the Spider Handler. </p> <p>the built-in Middleware include:</p> <ul> <li>robots.txt</li> <li>proxy(HTTP,HTTPS,SOCKS5)</li> <li>cookies</li> <li>gzip</li> </ul> <p>Everything can make as Middleware if you want, its easy to extensible.</p> <p>This jpg file overview the whole architecture design, <a href="https://raw.githubusercontent.com/wiki/antchfx/antch/imgs/antch_architecture_01.png" rel="nofollow">view jpg</a>, like the scrapy architecture.</p> <p>Project: <a href="https://github.com/antchfx/antch" rel="nofollow">https://github.com/antchfx/antch</a></p> <p>The next plan is build a distributed web crawler that building on the Antch project.</p> <p>Does anyone have any idea or suggestions?</p> <hr/>**评论:**<br/><br/>flatMapds: <pre><p>I have done something similar before, in my security research. </p> <p>Well honestly a big thing you should invest in for a distributed web crawler is scheduling / lb algorithms built into the clients, I followed this variation of P2C <a href="https://cs.stanford.edu/%7Ematei/papers/2013/sosp_sparrow.pdf" rel="nofollow">https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf</a> , also another thing that saves up a good bit of work, as you go along, is writing an endpoint that has a bloom filter to check if a link is found before it&#39;s submitted to the workers, preferabley one that is mergable along with any other state, also for the sake of avoided duplicate work being sent instead of simply replicating requests to N masters to trigger the state change, I suggest you should just have them periodically pull eachother to merge such state. </p> <p>As for what kind of state you may have, worker membership, bloom filter, and master membership.</p></pre>menuvb: <pre><p>Thanks for you suggestion, this project just begin starting, and lack of some middleware/Components, such as filter duplicate URLs, HTTP link depth tracking,etc... </p> <p>Before start distributed web crawler(if i&#39;m still want), need a lot of time to made this framework more availability and extensible.</p></pre>matiasbaruch: <pre><p>Interesting, I&#39;ve recently started to play around the idea of a web crawling platform (something similar to ScrapingHub or Apify but completely open source), using Go. I won&#39;t provide a framework for implementing the crawlers so the goal is to support whatever you need to deploy/orchestrate, collect the data from them, etc., even if they&#39;re written in different languages.</p> <p>In case you&#39;re curious you may find it here: <a href="https://github.com/zcrawl/zcrawl" rel="nofollow">https://github.com/zcrawl/zcrawl</a></p></pre>
19 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传