gocrawl 分析

harrysun · · 5008 次点击 · · 开始浏览    
这是一个创建于 的文章,其中的信息可能已经有所发展或是发生改变。

1. gocrawl 类结构

 

 1 // The crawler itself, the master of the whole process
 2 type Crawler struct {
 3     Options *Options
 4 
 5     // Internal fields
 6     logFunc         func(LogFlags, string, ...interface{})
 7     push            chan *workerResponse
 8     enqueue         chan interface{}
 9     stop            chan struct{}
10     wg              *sync.WaitGroup
11     pushPopRefCount int
12     visits          int
13 
14     // keep lookups in maps, O(1) access time vs O(n) for slice. The empty struct value
15     // is of no use, but this is the smallest type possible - it uses no memory at all.
16     visited map[string]struct{}
17     hosts   map[string]struct{}
18     workers map[string]*worker
19 }

 

 1 // The Options available to control and customize the crawling process.
 2 type Options struct {
 3     UserAgent             string
 4     RobotUserAgent        string
 5     MaxVisits             int
 6     EnqueueChanBuffer     int
 7     HostBufferFactor      int
 8     CrawlDelay            time.Duration // Applied per host
 9     WorkerIdleTTL         time.Duration
10     SameHostOnly          bool
11     HeadBeforeGet         bool
12     URLNormalizationFlags purell.NormalizationFlags
13     LogFlags              LogFlags
14     Extender              Extender
15 }

 

 1 // Extension methods required to provide an extender instance.
 2 type Extender interface {
 3     // Start, End, Error and Log are not related to a specific URL, so they don't
 4     // receive a URLContext struct.
 5     Start(interface{}) interface{}
 6     End(error)
 7     Error(*CrawlError)
 8     Log(LogFlags, LogFlags, string)
 9 
10     // ComputeDelay is related to a Host only, not to a URLContext, although the FetchInfo
11     // is related to a URLContext (holds a ctx field).
12     ComputeDelay(string, *DelayInfo, *FetchInfo) time.Duration
13 
14     // All other extender methods are executed in the context of an URL, and thus
15     // receive an URLContext struct as first argument.
16     Fetch(*URLContext, string, bool) (*http.Response, error)
17     RequestGet(*URLContext, *http.Response) bool
18     RequestRobots(*URLContext, string) ([]byte, bool)
19     FetchedRobots(*URLContext, *http.Response)
20     Filter(*URLContext, bool) bool
21     Enqueued(*URLContext)
22     Visit(*URLContext, *http.Response, *goquery.Document) (interface{}, bool)
23     Visited(*URLContext, interface{})
24     Disallowed(*URLContext)
25 }

 

entry point:

 1 func main() {
 2     ext := &Ext{&gocrawl.DefaultExtender{}}
 3     // Set custom options
 4     opts := gocrawl.NewOptions(ext)
 5     opts.CrawlDelay = 1 * time.Second
 6     opts.LogFlags = gocrawl.LogError
 7     opts.SameHostOnly = false
 8     opts.MaxVisits = 10
 9 
10     c := gocrawl.NewCrawlerWithOptions(opts)
11     c.Run("http://0value.com")
12 }

 

3 steps:  in main

1) get a Extender

2) create Options with given Extender

3) create gocrawel

as it is commented, go crawel contols the whole process, Option supplies some configuration info and Extender does the real work.

 

2. other key structs

worker, workResponse and sync.WaitGroup

1 // Communication from worker to the master crawler, about the crawling of a URL
2 type workerResponse struct {
3     ctx           *URLContext
4     visited       bool
5     harvestedURLs interface{}
6     host          string
7     idleDeath     bool
8 }

 

 1 // The worker is dedicated to fetching and visiting a given host, respecting
 2 // this host's robots.txt crawling policies.
 3 type worker struct {
 4     // Worker identification
 5     host  string
 6     index int
 7 
 8     // Communication channels and sync
 9     push    chan<- *workerResponse
10     pop     popChannel
11     stop    chan struct{}
12     enqueue chan<- interface{}
13     wg      *sync.WaitGroup
14 
15     // Robots validation
16     robotsGroup *robotstxt.Group
17 
18     // Logging
19     logFunc func(LogFlags, string, ...interface{})
20 
21     // Implementation fields
22     wait           <-chan time.Time
23     lastFetch      *FetchInfo
24     lastCrawlDelay time.Duration
25     opts           *Options
26 }

 

for info about sync.WaitGroup, please visit http://mindfsck.net/example-golang-makes-concurrent-programming-easy-awesome/ and http://soniacodes.wordpress.com/2011/02/28/channels-vs-sync-package/

3. I will give a whole workflow of gocrawl in a few days.(6/20/2014)

 

 

 

 


有疑问加站长微信联系(非本文作者)

本文来自:博客园

感谢作者:harrysun

查看原文:gocrawl 分析

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

5008 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传