new web crawler framework,any idea?

xuanbao · · 72 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>Hi Everyone,</p> <p>A few days ago, I&#39;m decide rewrote <a href="" rel="nofollow">Antch</a> project, Its open source web crawler framework that inspired by <a href="" rel="nofollow">Scrapy</a> project.</p> <p>The Antch had two core components: <strong>Middleware</strong> and <strong>Item Pipeline</strong>, the Middleware component is for HTTP download(etc, robots.txt, cookies, gzip), the Item Pipeline Component is for to process received Item data that from the Spider Handler. </p> <p>the built-in Middleware include:</p> <ul> <li>robots.txt</li> <li>proxy(HTTP,HTTPS,SOCKS5)</li> <li>cookies</li> <li>gzip</li> </ul> <p>Everything can make as Middleware if you want, its easy to extensible.</p> <p>This jpg file overview the whole architecture design, <a href="" rel="nofollow">view jpg</a>, like the scrapy architecture.</p> <p>Project: <a href="" rel="nofollow"></a></p> <p>The next plan is build a distributed web crawler that building on the Antch project.</p> <p>Does anyone have any idea or suggestions?</p> <hr/>**评论:**<br/><br/>flatMapds: <pre><p>I have done something similar before, in my security research. </p> <p>Well honestly a big thing you should invest in for a distributed web crawler is scheduling / lb algorithms built into the clients, I followed this variation of P2C <a href="" rel="nofollow"></a> , also another thing that saves up a good bit of work, as you go along, is writing an endpoint that has a bloom filter to check if a link is found before it&#39;s submitted to the workers, preferabley one that is mergable along with any other state, also for the sake of avoided duplicate work being sent instead of simply replicating requests to N masters to trigger the state change, I suggest you should just have them periodically pull eachother to merge such state. </p> <p>As for what kind of state you may have, worker membership, bloom filter, and master membership.</p></pre>menuvb: <pre><p>Thanks for you suggestion, this project just begin starting, and lack of some middleware/Components, such as filter duplicate URLs, HTTP link depth tracking,etc... </p> <p>Before start distributed web crawler(if i&#39;m still want), need a lot of time to made this framework more availability and extensible.</p></pre>matiasbaruch: <pre><p>Interesting, I&#39;ve recently started to play around the idea of a web crawling platform (something similar to ScrapingHub or Apify but completely open source), using Go. I won&#39;t provide a framework for implementing the crawlers so the goal is to support whatever you need to deploy/orchestrate, collect the data from them, etc., even if they&#39;re written in different languages.</p> <p>In case you&#39;re curious you may find it here: <a href="" rel="nofollow"></a></p></pre>
72 次点击  
加入收藏 微博
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传