Creeper Go实现的新一代爬虫框架 Creeper

plutonist • 2989 次点击    
这是一个分享于 的项目,其中的信息可能已经有所发展或是发生改变。
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=flat)](https://opensource.org/licenses/Apache-2.0) [![PyPI](https://img.shields.io/pypi/status/Django.svg?style=flat)]() ![Creeper](https://raw.githubusercontent.com/wspl/creeper/master/art/Creeper.png) ## About Creeper is a *next-generation* crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc. **Warning:** At present this project is still under stage-1 development, please do not use in the production environment. ## Get Started #### Installation ``` $ go get github.com/wspl/creeper ``` #### Hello World! Create `hacker_news.crs` ``` page(@page=1) = "https://news.ycombinator.com/news?p={@page}" news[]: page -> $("tr.athing") title: $(".title a.storylink").text site: $(".title span.sitestr").text link: $(".title a.storylink").href ``` Then, create `main.go` ```go package main import "github.com/wspl/creeper" func main() { c := creeper.Open("./hacker_news.crs") c.Array("news").Each(func(c *creeper.Creeper) { println("title: ", c.String("title")) println("site: ", c.String("site")) println("link: ", c.String("link")) println("===") }) } ``` Build and run. Console will print something like: ``` title: Samsung chief Lee arrested as S.Korean corruption probe deepens site: reuters.com link: http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD === title: ReactOS 0.4.4 Released site: reactos.org link: https://reactos.org/project-news/reactos-044-released === title: FeFETs: How this new memory stacks up against existing non-volatile memory site: semiengineering.com link: http://semiengineering.com/what-are-fefets/ ``` ## Script Spec ### Town Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url. ``` page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}" ``` When you need town, use it as if you were calling a function: ``` news[]: page(ext="Hello World!") -> $("tr.athing") ``` Hey, you might have noticed that the `@page` parameter is not used. Yeah, it is a special parameter. Expression in town definition line like `name="something"`, represents parameter `name` has a default value `"something"`. Incidentally, `@page` is a parameter that will automatically increasing when current page has no more content. ### Node Nodes are tree structure that represent the data structure you are going to crawl. ``` news[]: page -> $("tr.athing") title: $(".title a.storylink").text site: $(".title span.sitestr").text link: $(".title a.storylink").href ``` Like `yaml`, nodes distinguishes the hierarchy by indentation. #### Node Name Node has name. `title` is a field name, represents a general string data. `news[]` is a array name, represents a parent structure with multiple sub-data. #### Page Page indicates where to fetching the field data. It can be a town expression or field reference. Field reference is a advanced usage of Node, you can found the details in [./eh.crs](./eh.crs). If a node owned page and fun at the same time, page should on the left of `->`, fun should on the right of `->`. Which is `page -> fun` #### Fun Fun represents the data processing process. There are all supported funs: | Name | Parameters | Description | | --------- | -------------------------------- | ---------------------------------------- | | $ | (selector: string) | CSS selector | | html | | inner HTML | | text | | inner text | | outerHTML | | outer HTML | | attr | (attr: string) | attribute value | | style | | style attribute value | | href | | href attribute value | | src | | src attribute value | | calc | (prec: int) | calculate arithmetic expression | | match | (regexp: string) | match first sub-string via regular expression | | expand | (regexp: string, target: string) | expand matched strings to target string | ## Author Plutonist > [impl.moe](https://impl.moe) · Github [@wspl](https://github.com/wspl)
授权协议:
开发语言:
Golang 查看源码»
操作系统:
全平台
2989 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传