I wrote an Image Scraper in go.
I tried to implement concurrency.
I would love to know what I can improve/best practices.
评论:
Ballresin:
SeerUD:I'd only suggest you create your own http.Client and set a readtimeout, otherwise you might see some very long wait times while connections just dangle. The default client will literally wait forever.
Djent_:1) Definitely take a look into enabling
gofmt
orgoimports
support in your editor (if available), or get something that can trigger it on-save. I can see some style issues in your code (mixed indentation for example).2) Local variables usually use camelCase not PascalCase.
3) Comments like these are pointless:
var wg sync.WaitGroup // Used for waiting for channels
Also, you're not waiting for the channels here, you're waiting for the goroutines. You could probably also use channels to do the waiting too, they also block.
4) In
DownloadImg
you're deferring right at the end of the function. You don't need the defer there. Putting it at the top would make sense though, from both a readability and use-of-defer standpoint.5) Could you replace
resp.Find("*")
withresp.Find("img")
? It looks like it could speed things up.6) Have you considered the how testable this code is? It can be very tricky when working with IO, but I've recently learned it's quite an important area to learn how to test effectively.
7) Do the function's you've exposed need to be exposed? Probably not.
readytoruple:Your variable names could use some work, specifically in SliceUniq and when you dispatch your downloaders. Try renaming "l" to "min" and "i" to "max" there. As other's have mentioned, run gofmt
Djent_:Cool. One thing that jumps out is:
You are currently waiting on your slowest URL to finish scanning before you start downloading, why not dispatch the downloaders as soon as img uris are scanned?
nnunley:They are checking to make sure all urls are unique once they have all been crawled, so they can't start downloading unless they know it is a unique url. However, they can check the url before adding it to Images, which may be more efficient.
poxopox:Why not use a map of url to boolean. use that to gate whether to fetch the image or not.
https://blog.golang.org/go-maps-in-action
Then you can have one goroutine scraping urls while you have another one (or more) fetching the images. Set the map for the src to true or false just before fetching (you might want a mutex to protect the access)
You should probably handle that first error immediately after you call that function
