Open sourcing our web crawler?

xuanbao · · 549 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

Hi Everyone, A requirement of my new startup was eventually building our own web crawler. I've been reading about it for quite awhile now, seeing how others have solved the problem of performing extremely broad web crawls. Very recently too, I even tried using the popular <a href="https://github.com/scrapy/scrapy">Scrapy</a> crawler, but it just didn't meet our goals. Apache Nutch is widely known, but is complicated to set up and <a href="https://wiki.apache.org/nutch/HardwareRequirements">requires a crazy amount of disk space and memory</a>, among other issues. <a href="https://github.com/PuerkitoBio/gocrawl">Gocrawl</a> was another option and it's coded in Go, but it's suited more for depth-first crawls. We wanted a crawler that could: <ul> <li>Perform a true <a href="https://en.wikipedia.org/wiki/Breadth-first_search">BFS</a> (Breadth-First Search) crawl of the web</li> <li>Crawl billions of web pages on a single machine, with 1 (or more) CPUs, using only a few GB of memory</li> <li>Use the disk for the entire URL queue while remaining performant, so the only limitation of crawl reach is disk space (much cheaper than memory)</li> <li>Crawl concurrently</li> <li>Crawl politely</li> </ul> So, after spending a long time learning Go and doing some groundwork like <a href="https://github.com/beeker1121/goque">Goque</a>, we now have our own crawler built 100% in Go that meets these goals. Here are the current stats of it running on a machine with 1 vCPU, 3.75GB of RAM, 1TB of disk space, running ~10 regular expression on 512KB of data or less: <a href="http://i.imgur.com/H33JYCD.png">http://i.imgur.com/H33JYCD.png</a> My question is, how much interest is there in us open sourcing this crawler? I'm also thinking of writing a post about this on our blog, just detailing how it works. If this is something you would be interested, let us know by upvoting, commenting, or both! <hr/>**评论：** Patatatarte: <pre>The community will always benefit from your work and the inverse is likely to be true. </pre>bkeroack: <pre>I'm a huge FOSS advocate, but this is clearly an address-harvesting crawler for spammers. I don't see how supporting this will benefit the community.</pre>Veonik: <pre>So would you say that we should censor knowledge because it might be misused?</pre>mwholt: <pre>I'd like to see the code powering email harvesters, they're going to run either way.</pre>unimportant1234567: <pre>I would use something like this for general web archiving and curation purposes. I bookmark sites and they die and I lose information because the site is gone. This may be a great tool for this use case.</pre>jerf: <pre>The web crawler is the hard part. Extracting email addresses out is like three lines attached to the crawler at the end. Even if that is what they happen to using it for, the crawler itself may still be useful.</pre>morethanaprogrammer: <pre>Would enjoy seeing the process as well</pre>nycalibjj: <pre>"Emails found" implies that this is an email address harvesting spambot. </pre>Yojihito: <pre>Which would explain the regex stuff in the description.</pre>Mr_Psmith: <pre>Yes, by all means! Would love to read your blog post as well.</pre>Traim: <pre>Blog post would really be appreciated. Currently I personally will not use the crawler but maybe in the future. </pre>lluad: <pre>I don't need a web crawler, but I'm still interested in seeing what approaches you used.</pre>unimportant1234567: <pre>Very interested in this type of work and article. Please let us see it :)</pre>bateller: <pre>Please write this up. I'd find this very useful.</pre>thesilentwitness: <pre>I'm curious why you would ask if you should open source it instead of just open sourcing it and then linking to it?</pre>pandawithpie: <pre>Does your crawler deal with client-side rendered pages (e.g. react, angular)? I've found that there's a lot of complexity related to rendering these pages before saving them to disk. Curious to know if you managed to solve this! </pre>Yojihito: <pre>A crawler in pure Go can't deal with those, no Javascript = no client-side rendering. You only get the html code.</pre>mwholt: <pre>That's not entirely true: <a href="https://github.com/robertkrimen/otto" rel="nofollow">https://github.com/robertkrimen/otto</a></pre>robertmeta: <pre>There are multiple javascript engines for Go. </pre>Yojihito: <pre>Do you have a list? I need a replacement for PhantomJS to execute JS on crawled pages.</pre>iamafuckingrobot: <pre>Yes I'm very interested! Please keep us updated. Edit: I've been working on a very similar solution, essentially an alternative to Nutch, for over a year as my FT job. Unfortunately not in Go, but you can see why I'm interested.</pre>UnchartedFr: <pre>I'm interested ! Since i'm a heavy user of scrapy, I feel that it has too many limitations because of our needs and was thinking to migrate to golang too. Do you handle Xpath like scrapy ?</pre>dAnjou: <pre>What are those limitations of Scrapy? Last time I used it it seemed very flexible.</pre>Asti_: <pre>Please write it up. I love learning from people who use Go to solve actual problems and not just the basic "We are going to create a tasks app..." form. Great job.</pre>BestKarmaEver: <pre>I know i would be very interested in the way it works</pre>Southclaw: <pre>I was actually thinking of writing a crawler in Go recently, this looks interesting so I'd say definitely open source it!</pre>itsamemmario: <pre>Please do!</pre>chmouelb: <pre>I'll definitely be interested to look how this is be implemented </pre>matiasbaruch: <pre>Sounds interesting!</pre>shazow: <pre>Sounds like a great codebase with lots of interesting things to learn from. I'd love to read through it, and would definitely contribute back if I used it. Please open source it sooner than later. :)</pre>Yojihito: <pre>I've build my own crawler as a side project in the office in go but it's limited (by choice) to crawl a single domain and get all links, status codes and link sources in a .csv file. I would be very interested in your crawler. What is your crawler crawling? What do you need 1TB full of links for?</pre>sarcasmkills: <pre>Open source! Sounds awesome!</pre>Asdayasman: <pre>Not being super comfortable with reading Go code yet, I'm always down to read a writeup.</pre>

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

549 次点击

加入收藏微博

web

github

apache

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Open sourcing our web crawler?

用户登录

今日阅读排行

一周阅读排行

最新主题