Concurrent File Processing

Hi;

Is there a good example on how to implement concurrent file readers? The examples I see on the Internet they open the files, do something on the fly and they are done. I need to store the data in a slice or map, for example.

In my specific case, I have several XML files that take some time to read, they are mapped to structs.

Thanks

评论：

tv64738:

Isn't this just an instance of https://blog.golang.org/pipelines ?

prvst:

ok, I'll take a look. thanks

justinisrael:

https://gobyexample.com/worker-pools

prvst:

that looks promising, thanks for that, I'll take a look

residentbio:

Do you mean reading multiple files, or one file many processes?

Edit: What kind of coordination do you expect after files are read?

prvst:

I meant reading multiple xml files, convert them to a struct and store them into a map or slice

SeerUD:

I've done something similar before with tarring multiple directories at once, up to a limit (usually the number of CPU cores Go has access to). You just make several readers:

This is the function in question: https://github.com/SeerUK/foldup/blob/dc1fc2e13149cdd8d4695d8a5a89b8c4cf404489/pkg/archive/archive.go#L33

prvst:

Thanks I'll take a look

prvst:

sorry I dont understand what you mean by coordination, i just want a way to read files in paralel/concurrent mode and to be able to store the processed data somewhere, the files are independent from each other

Creshal:

As /u/tv64738 mentioned, this is a case for https://blog.golang.org/pipelines

You'd collect all the results in the main() function, and then do whatever you need to do with them.

prvst:

ok, I'll take a look. thanks

tmornini:

Loop over file pathnames/urls.

Put each one into a channel also passed to N goroutines.

Each of those range over the channel and process the pathname/url.

Not sure how you plan to use the results, or I'd continue: :-)

prvst:

i keep finding similar implementations but i cant find an example showing how to store the processed data

SeerUD:

Where do you want to store it? If it's in a database, or a file, you could throw the results into a channel, return that from some function, and then consume it in another function to handle storing the results.

prvst:

some sort of list like a slice or a map

SeerUD:

Well, a channel would still work for doing that, as I don't think maps or slices are thread-safe, so, if you could push all of the results into a channel, those could be consumed from the channel by a single thread very quickly. You don't need to bother with things like mutexes then either.

So, you have this channel of results, and then you just loop over it when you've got all of them. That channel of results would have been filled with values by several goroutines before that point, concurrently (i.e. when you're processing the XML).

tmornini:

Store it where?

prvst:

some sort of list like a slice or a map

tmornini:

In main, pathname channel and N struct channels are created.

N async deserializer goroutines are started, each with pathname struct and struct channel of its own.

When all pathnames are channeled, main closes pathname channel.

Each async deserializer puts each struct into its own struct channel then closes its struct channel after its range loop, which ends on next loop after pathname channel is closed.

main ranges over each struct channel and stuffs structs into the final slice or map.

Hope that helps.

residentbio:

Ok let me try to figure out what you want.

You have N files on a hard drive. They could also be on some Rest api(You read them over the internet)

Since there are many files, you could use one go routine per file. That way you could read N files at the same time. 

I asked you if after the files are read, they interact between them via some sort of business logic. The reason for this is that, you could keep the go routine and keep processing the data as you see fit for each file.  "do something on the fly and they are done" let me to believe this is the case.

If for example, after all the files process are done, you want to show to the end user, that all process are done. You could use a waitgroup. Final response happens when all go routines have finished.

All in all, read about:

go routines
waitgroup
channels
All in all, seems like you just need many go routines fired inside a bucle. That would be the basics of go routines and I would advise you to go read about them.

prvst:

Thanks for the detailed commentary;
What I have is a series of files (i.e. XML, kinda big), and the program needs to read them and put the mapped structs in a list or map. There is no interaction between them at this point. After having all the XML parsed, the structs will be passed to other functions that will show the user the results or apply some other logic on them.
What I have now:
My test data consist of 24 XML files, I have a loop that goes one by one, reads them, gets them "unmarshalled" (sorry not sure if this expression exists),  and appends the result into a list, this takes about 15min to happen (some of the fields need to be transformed, but this is already part of the function that works on the XML file).
My ultimate goal is just to spead up the process by having them processed simultaneously.

residentbio:

Ok here is how I would do it, If I were you.

Have an array of strings with the 'path' to the files you are gonna read.

Make a for loop on the array and do the following

for path:= range paths{
  go func(){
    loadXML(path)
  }
}

The former just launch a go routine per file, which should really speed up the loading. But here comes the tricky part.

Is this a background job? which means the user does not get any feedback. Then I would just add the 'business logic' inside the loadXML and be done with it. (It basically acts a background job). You end up processing 24 files concurrently.

But if not, I see two ways about it. Simple one would be to use waitgroup

wg sync.WaitGroup
for path:= range paths{
wg.Add(1)
go func(){
   Structure := loadXML(path, wg) //Maybe structure should be an array of structures and use append
}
}
func loadXML(path string, wg sync.WaitGroup) {
defer wg.done()
//marshaling code goes here
}
wg.Wait() //the program waits for all the go routines to finish
//After this you should have all your structures in memory and you can decide what to do with them

The more 'complex' one would be to use channels, and that would allow you to get even more performance, as you would spawn a routine to marshal the data, and then spawn another one to apply the business logic. But then you need to coordinate the routines to give feedback to the user(or the the program itself) that all process are done.

See if the basic setup works for you first before trying the complex method. Use this as guide https://nathanleclaire.com/blog/2014/02/15/how-to-wait-for-all-goroutines-to-finish-executing-before-continuing/ . I know I did a few months ago.

residentbio:

All in all, the biggest gain I see so far is for you to read the 24 files concurrently.

cheemosabe:

Unless you're reading them from independent drives you'll be IO bound and concurrency won't help you.

Creshal:

You'd be surprised how many CPU cycles you can sink into parsing (needlessly) complex XML documents.

And then you have shit like remote XSLT references that need to be downloaded, which is I/O bound, but can be parallelized in most cases (latency, not bandwidth bound).

Redundancy_:

It's also worth mentioning that having multiple read requests queued up and waiting allows the disk to keep working, rather than stop while you do something with each file.

tmornini:

What if the files are on S3 :-)

用户登录

今日阅读排行

一周阅读排行

最新主题