Web Scraper Help!

agolangf · · 790 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

Hello Reddit, Here's our problem: I have been assigned the task of creating a .csv file that automatically updates itself. I have to download a .csv file and add the information from it onto a master list. This update must take place every day at a specific time. I have to complete this task in Golang, and my knowledge of Go is limited. So, the thing that I need help with is that I need to use a web scraper in order to obtain the download link for the .csv file. How should I go about this? Thanks! <hr/>**评论：** Dummies102: <pre>you probably want to use something like this <a href="https://github.com/PuerkitoBio/goquery" rel="nofollow">https://github.com/PuerkitoBio/goquery</a></pre>tclineks: <pre>it sounds like he's consuming csv (not html scraping)</pre>jamra06: <pre>I think OP is under the impression that he needs to scrape the web page to find the CSV download link. I personally would take the route of trying to find the direct download link to the CSV manually and then use that -- unless of course it is not consistent.</pre>steakholder69420: <pre>Thanks, that really helps! </pre>tclineks: <pre>I'd recommend reading introductory material and then look into how to fetch resources over http and how to read and write csv files. There are packages in the standard library that allow you to accomplish this (specifically, net/http and encoding/csv).</pre>steakholder69420: <pre>Thanks so much! Sounds like a great plan! </pre>raff99: <pre>Does the download link changes every day ? Is it something you can guess without having to parse the container page ? Maybe you want to prototype it using a couple of curl commands (fetch the container page, fetch the CSV), then write down the exact steps you need to execute (you mention a "master list". Is this another CSV file, a database or what ?) and then ask how you can implement these steps in Go (the first two would be basic HTTP requests, for which you can see examples in net/http as suggested by others)</pre>steakholder69420: <pre>Thanks a lot for your help! I'm pretty sure that the download link is constant. As well, the master list is another csv file that has to be updated based on changes in other csv files from the Internet. </pre>nyoungman: <pre>If that is the case, you won't need goquery or golang.org/x/net/html which underlies goquery. Just hard code the urls to the csv files or pass them as command line arguments. Just download the file using http.Client. You will likely want to os.Open a file and then io.Copy the request Body to the file. Are you updating the existing CSV file in place? To make it a little more robust, you can download the file to a temporary location (use ioutil.TempFile) and only overwrite the previous CSV if there were no errors (use os.Rename with the Name() of the tempfile and the final location).</pre>

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

790 次点击

加入收藏微博

web

net

github

io

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Web Scraper Help!

用户登录

今日阅读排行

一周阅读排行

最新主题