Go web page scraper

Hey guys, first I'd like to say thanks to everybody that helped me out later last week... So thanks!

I've got a decent sized project in mind I'd like to make with Go, but I'd like to create a really solid base first. For the moment, I'd like to make a web page scraper that is, for now, grabbing specific info I want out of a specific web page. To give this some context, because I'm terrible at describing things, I want to take the item names, rarity, capacity, value, and how to get text out of this web page: http://monsterhunter.wikia.com/wiki/MH4U:_Item_List and then put them into a comma delimited text file.

This is just for a personal project so I can gain some experience in the language and I won't be redistributing this websites info or anything like that.

Do you guys know of any tutorials or articles I should read that would help me create this? All suggestions and tips are seriously greatly appreciated!

I forgot to mention above that I've been went through a couple tutorials like this: http://schier.co/blog/2015/04/26/a-simple-web-scraper-in-go.html and the reason I'm posting this question instead of just using google to find info is because the stuff you guys had was MUCH better than what I found and hearing multiple opinions/ideas is extremely helpful!

Thanks again so much!

评论：

gschier2:

Hey there. I'm the author of that blog post.

I wrote that post after finishing http://tour.golang.org/ and watching a few Go talks on concurrency. I personally find the best way of learning is to just try things, so I recommend building a base knowledge of Go concurrency patterns (very useful for a web scraper) and finding some non-go-specific posts on web scraper design (although not required).

It's a pretty open-ended project and there are a lot of different way to go about it, so be creative and have fun! Also, screwing up the first five times you write it will teach you a lot more than getting it right the first time by reading someone else's tutorials! :D

I'd be happy to help out as well if you have any questions.

Elegantmetal:

Ok great thanks for the advice! Also thanks for that tutorial, it seriously helped

Bromlife:

Here's something: http://rockyj.in/2014/12/12/scraping_with_go.html

You might also find this interesting: https://github.com/ernesto-jimenez/scraperboard

Elegantmetal:

Those both look really interesting thanks!

stone_henge:

My tip is to scrape from the edit page.

Then you can write a simple regular expression to match all the fields, like this.

everdev:

Fastest performance would be a regex on the HTML response from the server.

Fastest dev would probably be GoQuery: https://github.com/PuerkitoBio/goquery

Lots of tutorials on outputting content to a file. With CSV just be sure to scrub the data for commas before writing to your file.

Simpfally:

Goquery is good, even if it was a bit annoying to use jquery's doc to use goquery.

KimIlYong:

There is a blog post which describes in detail how to use GoQuery to crawl posts from reddit and stores it into a database.

http://intogooglego.blogspot.co.at/2015/05/day-7-goquery-html-parsing.html

用户登录

今日阅读排行

一周阅读排行

最新主题