Go web page scraper

xuanbao · 2015-05-28 03:52:16 · 902 次点击    
这是一个分享于 2015-05-28 03:52:16 的资源,其中的信息可能已经有所发展或是发生改变。

Hey guys, first I'd like to say thanks to everybody that helped me out later last week... So thanks!

I've got a decent sized project in mind I'd like to make with Go, but I'd like to create a really solid base first. For the moment, I'd like to make a web page scraper that is, for now, grabbing specific info I want out of a specific web page. To give this some context, because I'm terrible at describing things, I want to take the item names, rarity, capacity, value, and how to get text out of this web page: http://monsterhunter.wikia.com/wiki/MH4U:_Item_List and then put them into a comma delimited text file.

This is just for a personal project so I can gain some experience in the language and I won't be redistributing this websites info or anything like that.

Do you guys know of any tutorials or articles I should read that would help me create this? All suggestions and tips are seriously greatly appreciated!

I forgot to mention above that I've been went through a couple tutorials like this: http://schier.co/blog/2015/04/26/a-simple-web-scraper-in-go.html and the reason I'm posting this question instead of just using google to find info is because the stuff you guys had was MUCH better than what I found and hearing multiple opinions/ideas is extremely helpful!

Thanks again so much!


评论:

gschier2:

Hey there. I'm the author of that blog post.

I wrote that post after finishing http://tour.golang.org/ and watching a few Go talks on concurrency. I personally find the best way of learning is to just try things, so I recommend building a base knowledge of Go concurrency patterns (very useful for a web scraper) and finding some non-go-specific posts on web scraper design (although not required).

It's a pretty open-ended project and there are a lot of different way to go about it, so be creative and have fun! Also, screwing up the first five times you write it will teach you a lot more than getting it right the first time by reading someone else's tutorials! :D

I'd be happy to help out as well if you have any questions.

Elegantmetal:

Ok great thanks for the advice! Also thanks for that tutorial, it seriously helped

Bromlife:

Here's something: http://rockyj.in/2014/12/12/scraping_with_go.html

You might also find this interesting: https://github.com/ernesto-jimenez/scraperboard

Elegantmetal:

Those both look really interesting thanks!

stone_henge:

My tip is to scrape from the edit page.

Then you can write a simple regular expression to match all the fields, like this.

everdev:

Fastest performance would be a regex on the HTML response from the server.

Fastest dev would probably be GoQuery: https://github.com/PuerkitoBio/goquery

Lots of tutorials on outputting content to a file. With CSV just be sure to scrub the data for commas before writing to your file.

Simpfally:

Goquery is good, even if it was a bit annoying to use jquery's doc to use goquery.

KimIlYong:

There is a blog post which describes in detail how to use GoQuery to crawl posts from reddit and stores it into a database.

http://intogooglego.blogspot.co.at/2015/05/day-7-goquery-html-parsing.html


入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

902 次点击  
加入收藏 微博
暂无回复
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传