最近发现了一个quote-lib网站:https://www.goodreads.com
于是了解到golang有个在github上star数超过6K的爬虫框架:Colly.
项目目的
我想首先将这个goodreads的quotes全都爬下来,然后保存到一个文件里。 最后解析爬下来的quotes,为了优美的markdown效果而格式化每个quote,使得在网页中这样展示出来:
每条quote有三个元素:quote的类型, quote文本体,作者或出处
“We are what we pretend to be, so we must be careful about what we pretend to be.” .
Kurt Vonnegut, Mother Night“Sometimes you wake up. Sometimes the fall kills you. And sometimes, when you fall, you fly.”
Neil Gaiman, Fables & Reflections
准备工作
简要介绍
Lightning Fast and Elegant Scraping Framework for Gophers.
Colly provides a clean interface to write any kind of crawler/scraper/spider.
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.
go-colly git-repo url
gocolly/colly : https://github.com/gocolly/colly
安装
$ go get -u github.com/gocolly/colly/...
go环境
$ go version
go version go1.12.8 linux/amd64
you can export GO111MODULE=on optionaly
快速上手
draft:
package main
import (
"fmt"
"os"
"regexp"
"strings"
"github.com/gocolly/colly"
"github.com/gocolly/colly/extensions"
)
func main() {
fileName := "quote.md"
file, errFile := os.Create(fileName)
if errFile != nil {
println("operating system create file error :%s", errFile.Error())
panic(errFile)
}
defer func() {
err := file.Close()
if err != nil {
println("file close error")
}
}()
c := colly.NewCollector()
errProxy := c.SetProxy("http://127.0.0.1:1080/")
if errProxy != nil {
println("colly set proxy error :%s", errProxy.Error())
panic(errProxy)
}
// c.AllowedDomains = []string{"https://www.goodreads.com"}
c.AllowURLRevisit = true
extensions.RandomUserAgent(c)
c.OnHTML(".quoteText ",
func(e *colly.HTMLElement) {
text := strings.TrimSpace(strings.Split(e.Text, "―")[0])
author := TrimSpaceNewlineInString(strings.TrimSpace(e.ChildText(".authorOrTitle")))
fileWriteForMarkdown(file, text, author)
})
c.OnHTML(".next_page", func(e *colly.HTMLElement) {
println("visit: ", e.Request.AbsoluteURL(e.Attr("href")))
errHrefVisit := c.Visit(e.Request.AbsoluteURL(e.Attr("href")))
if errHrefVisit != nil {
panic(errHrefVisit)
}
})
errVisit := c.Visit("https://www.goodreads.com/quotes/tag/philosophy")
if errVisit != nil {
panic(errVisit)
}
}
func TrimSpaceNewlineInString(s string) string {
re := regexp.MustCompile(`\n`)
return re.ReplaceAllString(s, " ")
}
func fileWriteForMarkdown(file *os.File, lines ...string) {
var admotionBot = `
\{\{% /admonition %\}\}
`
head := fmt.Sprintf(`
\{\{%% admonition quote "%s" %%\}\}
`, lines[1])
_, err := (*file).Write([]byte(head))
if err != nil {
println("file write error ", err.Error())
}
_, err = (*file).Write([]byte(lines[0]))
if err != nil {
println("file write error ", err.Error())
}
_, err = (*file).Write([]byte(admotionBot))
if err != nil {
println("file write error ", err.Error())
}
}
func fileWriteDirect(file *os.File,lines ...string){
_, err := (*file).Write([]byte(lines[0]))
if err != nil {
println("file write error ", err.Error())
}
_, err = (*file).Write([]byte(lines[1]))
if err != nil {
println("file write error ", err.Error())
}
}
有疑问加站长微信联系(非本文作者)