gocrawl 分析

harrysun · · 5008 次点击 · · 开始浏览

这是一个创建于的文章，其中的信息可能已经有所发展或是发生改变。

1. gocrawl 类结构

 1 // The crawler itself, the master of the whole process
 2 type Crawler struct {
 3     Options *Options
 4 
 5     // Internal fields
 6     logFunc         func(LogFlags, string, ...interface{})
 7     push            chan *workerResponse
 8     enqueue         chan interface{}
 9     stop            chan struct{}
10     wg              *sync.WaitGroup
11     pushPopRefCount int
12     visits          int
13 
14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
15     // is of no use, but this is the smallest type possible - it uses no memory at all.
16     visited map[string]struct{}
17     hosts   map[string]struct{}
18     workers map[string]*worker
19 }

 1 // The Options available to control and customize the crawling process.
 2 type Options struct {
 3     UserAgent             string
 4     RobotUserAgent        string
 5     MaxVisits             int
 6     EnqueueChanBuffer     int
 7     HostBufferFactor      int
 8     CrawlDelay            time.Duration // Applied per host
 9     WorkerIdleTTL         time.Duration
10     SameHostOnly          bool
11     HeadBeforeGet         bool
12     URLNormalizationFlags purell.NormalizationFlags
13     LogFlags              LogFlags
14     Extender              Extender
15 }

 1 // Extension methods required to provide an extender instance.
 2 type Extender interface {
 3     // Start, End, Error and Log are not related to a specific URL, so they don't
 4     // receive a URLContext struct.
 5     Start(interface{}) interface{}
 6     End(error)
 7     Error(*CrawlError)
 8     Log(LogFlags, LogFlags, string)
 9 
10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
11     // is related to a URLContext (holds a ctx field).
12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
13 
14     // All other extender methods are executed in the context of an URL, and thus
15     // receive an URLContext struct as first argument.
16     Fetch(*URLContext, string, bool) (*http.Response, error)
17     RequestGet(*URLContext, *http.Response) bool
18     RequestRobots(*URLContext, string) ([]byte, bool)
19     FetchedRobots(*URLContext, *http.Response)
20     Filter(*URLContext, bool) bool
21     Enqueued(*URLContext)
22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
23     Visited(*URLContext, interface{})
24     Disallowed(*URLContext)
25 }

entry point:

 1 func main() {
 2     ext := &Ext{&gocrawl.DefaultExtender{}}
 3     // Set custom options
 4     opts := gocrawl.NewOptions(ext)
 5     opts.CrawlDelay = 1 * time.Second
 6     opts.LogFlags = gocrawl.LogError
 7     opts.SameHostOnly = false
 8     opts.MaxVisits = 10
 9 
10     c := gocrawl.NewCrawlerWithOptions(opts)
11     c.Run("http://0value.com")
12 }

3 steps: in main

1) get a Extender

2) create Options with given Extender

3) create gocrawel

as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

2. other key structs

worker, workResponse and sync.WaitGroup

1 // Communication from worker to the master crawler, about the crawling of a URL
2 type workerResponse struct {
3     ctx           *URLContext
4     visited       bool
5     harvestedURLs interface{}
6     host          string
7     idleDeath     bool
8 }

 1 // The worker is dedicated to fetching and visiting a given host, respecting
 2 // this host's robots.txt crawling policies.
 3 type worker struct {
 4     // Worker identification
 5     host  string
 6     index int
 7 
 8     // Communication channels and sync
 9     push    chan<- *workerResponse
10     pop     popChannel
11     stop    chan struct{}
12     enqueue chan<- interface{}
13     wg      *sync.WaitGroup
14 
15     // Robots validation
16     robotsGroup *robotstxt.Group
17 
18     // Logging
19     logFunc func(LogFlags, string, ...interface{})
20 
21     // Implementation fields
22     wait           <-chan time.Time
23     lastFetch      *FetchInfo
24     lastCrawlDelay time.Duration
25     opts           *Options
26 }

for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

有疑问加站长微信联系（非本文作者）

本文来自：博客园

感谢作者：harrysun

查看原文：gocrawl 分析

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

5008 次点击

加入收藏微博

收入我的专栏

上一篇：mgo-后续测试(指定字段,获取id)

下一篇：FreeBSD go get 安装 iconv-go提示无 iconv.h 文件

http

net

context

分析

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

关注我

扫码关注领全套学习资料
加入 QQ 群：
- 192706294（已满）
- 731990104（已满）
- 798786647（已满）
- 729884609（已满）
- 977810755（已满）
- 815126783（已满）
- 812540095（已满）
- 1006366459（已满）
- 692541889
加入微信群：liuxiaoyan-s，备注入群
也欢迎加入知识星球 Go粉丝们（免费）

gocrawl 分析

用户登录

今日阅读排行

一周阅读排行

关注我

gocrawl 分析

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏