go语言解析html

nop4ss · 2015-08-06 20:00:07 · 40788 次点击 · 预计阅读时间 3 分钟 · 大约8小时之前开始浏览

这是一个创建于 2015-08-06 20:00:07 的文章，其中的信息可能已经有所发展或是发生改变。

第一次，站长亲自招 Gopher 了>>>

有两个不错的库：

https://github.com/PuerkitoBio/goquery

一个是

http://code.google.com/p/go.net/html

html是html的解析器，把html文本解析出来，goquery基于html包，在此基础上结合cascadia 包（一个css选择器工具），实现类似于jquery的功能，操作html非常方便。

使用goquery来查找，选择相应的html节点，但如果要对选择的节点进行修改，删除操作，还需要深入使用html包。

html包把html文本解析为一个树，这个树有很多Node组成，操作的核心就在于对Node的操作。

用几个例子来说明一下吧：

doc, err := goquery.NewDocument("http://sports.sina.com.cn")

生成一个goquery的doc。

goquery用的最多的是Find函数，类似于jquery的$()，可以选择dom结构。

例1：

dhead := doc.Find("head")
	dcharset := dhead.Find("meta[http-equiv]")
	charset, _ := dcharset.Attr("content")

这个例子用来找出页面的charset。

例2：

logo := doc.Find("#retina_logo")

这个是根据html中的id来选择dom

例3：

bread := doc.Find("div.blkBreadcrumbLink")

选择doc中class为blkBreadcrumbLink的div

例4：

var faceImg string
var innerImg = []string{}

dom_body.Find("div.img_wrapper").Each(func(i int, s *goquery.Selection) {
		imgpath, exists := s.Find("img").Attr("src")
		if !exists {
			return
		}

		if i == 0 {
			faceImg = imgpath
		}
		innerImg = append(innerImg, imgpath)
	})

找出所有class为img_wrapper的div，然后在每个div下搜索img，获取img的src

例5：

dom_node := doc.Find("[bosszone='ztTopic']").Find("a")

这个是根据属性/值来查找相应的元素

如果要对html进行编辑操作，需要使用html.Node，这里提供一个清洗div的代码，使用了递归：

func clear_dom(pn *html.Node, isgb2312 bool) error {
	var err error
	for nd := pn.FirstChild; nd != nil; {
		switch nd.Type {
		case html.ElementNode:
			tn := strings.ToLower(nd.Data)
			//fmt.Printf("element node: %s\n", tn)
			if tn == "script" || tn == "style" {
				// delete the element
				tmp := nd
				nd = tmp.NextSibling
				pn.RemoveChild(tmp)
			} else if tn == "a" {
				tmp := nd
				nd = nd.NextSibling

				if err = convert_dom(tmp, isgb2312); err != nil {
					return err
				}
			} else if tn == "span" {
				tmp := nd
				nd = nd.NextSibling

				clear_dom(tmp, isgb2312)
			} else {
				tmp := nd
				nd = nd.NextSibling

				if err = convert_dom(tmp, isgb2312); err != nil {
					return err
				}
			}
		case html.CommentNode:
			tmp := nd
			nd = tmp.NextSibling
			pn.RemoveChild(tmp)
		case html.TextNode:
			tmp := nd
			nd = nd.NextSibling

			if err = convert_dom(tmp, isgb2312); err != nil {
				return err
			}
		default:
			nd = nd.NextSibling
		}
	}

	return nil
}

其中conver_dom是对node节点的text进行转码操作，如果不需要，可以忽略。

func Nodehtml(n *html.Node) string {
	var buf = bytes.NewBuffer([]byte{})
	html.Render(buf, n)
	return buf.String()
}

func Nodetext(node *html.Node) string {
	if node.Type == html.TextNode {
		// Keep newlines and spaces, like jQuery
		return node.Data
	} else if node.FirstChild != nil {
		var buf bytes.Buffer
		for c := node.FirstChild; c != nil; c = c.NextSibling {
			buf.WriteString(Nodetext(c))
		}
		return buf.String()
	}

	return ""
}

上面的两个函数，分别获取节点的html代码和text代码。html代码和text代码的区别是，html代码是原封不动的html代码，text代码仅仅显示html代码的内容，例如一段html: 例子,它的text代码是”例子”

有疑问加站长微信联系（非本文作者）

本文来自：开源中国博客

感谢作者：nop4ss

查看原文：go语言解析html

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

40788 次点击 ∙ 2 赞

加入收藏微博

收入我的专栏

上一篇：Go语言及Web框架Beego环境搭建手顺

下一篇：See Android Go! Go, Android. Go!

代码

http

函数

选择器

2 回复 | 直到 2018-03-11 17:48:01

qkb_75_go · #1 · 10年之前

想做网络爬虫么？

Wusuluren · #2 · 7年之前

推荐一个模拟jquery的简单的库，https://github.com/Wusuluren/gquery

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

关注我

扫码关注领全套学习资料
加入 QQ 群：
- 192706294（已满）
- 731990104（已满）
- 798786647（已满）
- 729884609（已满）
- 977810755（已满）
- 815126783（已满）
- 812540095（已满）
- 1006366459（已满）
- 692541889
加入微信群：liuxiaoyan-s，备注入群
也欢迎加入知识星球 Go粉丝们（免费）

go语言解析html

用户登录

今日阅读排行

一周阅读排行

关注我

go语言解析html

用户登录

今日阅读排行

一周阅读排行

关注我

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏