What is everyone doing for batch inserts?

I'm at a point in an application where I have to insert around 15,000 entries in to two different tables, each, one after another.

I could do this via pipelines or various other methods, but the main question I have is, how does everyone handle gigantic database inserts where you need errors back for each individual one if they error out?

My current solution for one of the inserts is this..

type Req struct {
    FieldA, FieldB, FieldC, FieldD, FieldE, FieldF string
}
type Result struct {
    ID int
}

type InsertResult struct {
    LastInsertedID int64
    Error          error
}

func (d *DB) InsertRequests(reqs []Req) <-chan InsertResult {
    outChan := make(chan InsertResult, 500)
    go func() {
        defer close(outChan)
        wg := &sync.WaitGroup{}
        for _, req := range reqs {
            wg.Add(1)
            go func(req Req) {
                defer wg.Done()
                tx, err := d.db.Begin()
                if err != nil {
                    outChan <- InsertResult{Error: err}
                    return
                }

                res, err := tx.Exec("INSERT INTO table (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?)",
                    req.FieldA, req.FieldB, req.FieldC, req.FieldD, req.FieldE, req.FieldF)

                if err != nil {
                    tx.Rollback()
                    outChan <- InsertResult{Error: err}
                    return
                }
                id, err := res.LastInsertId()
                if err != nil {
                    outChan <- InsertResult{Error: err}
                    return
                }
                tx.Commit()
                outChan <- InsertResult{LastInsertedID: id}
            }(req)
        }

        wg.Wait()
    }()
    return outChan
}

It's fast, but it feels like I could murder the database (mysql) this way. I'd love to hear some input on better ways.

Thanks.

Edit--

Max batch size I made 2k, minimum was 20. Thread limit was 30 for any given request.

Since I needed, for this case, at least "relative failure" feedback, and not exact, I found a way that fits my needs.

Get batch (sample size of 60k), do math on parameters (8) that will fit within the right batch since that's the real calculation, apparently -- not the actual batch.

Split them up in to the right size of batches (2k each) and run them on goroutines (but limiting the goroutines to not take ALL of the CPU up and block requests)

If any batch fails, recursuvely run the same function but split the batch size up by 5. Keep doing this until the batch size is smaller than the minimum batch size (20). At the end, you'll end up actually having aound 5 entries fail, and one in there is bad. For this use case, that works.

With retries and tail-recursion (or the Go equivalent). I was able to get 60k entries in 12 seconds on my macbook talking to a small MySQL dev instance and it didn't kill the database or hog the CPU, or bring down the database.

评论：

eyesoftheworld4:

What you're describing isn't really a batch insert. You're just inserting a record at a time with a transaction for each insert. This is going to be significantly slower than an actual batch insert, for example using a prepared statement and db. Exec, which accepts variadic arguments of values to be passed to the statement. This way, you could do some tests to find out the optimal batch size (I'd probably start at 2.5/5k) to insert your data as quickly as possible.

I would ask, why do you need an error value for each individual record insert? I would suggest that you ensure that your data is "clean" before inserting it, or write database triggers to clean your data on insert to ensure it meets whatever standards you require.

Another advantage to this approach is that within a transaction, you get an "all or nothing" behavior: either all of your records go in, or none do, so if a batch fails to insert, you can then run some standardization process on your data and then attempt to insert them all again without worrying about duplicates.

Edit: here's an example from Stackoverflow: https://stackoverflow.com/a/25192138

natdm:

You're so right on how many to do on an optimal batch size. With 60k, 2k was the optimal size... I started out at 5k. 

Since I needed, for this case, at least "relative failure" feedback, and not exact, I found a way that fits my needs.

Get batch (sample size of 60k), do math on parameters (8) that will fit within the right batch since that's the real calculation, apparently -- not the actual batch.

Split them up in to the right size of batches (2k each) and run them on goroutines (but limiting the goroutines to not take ALL of them up)

If any batch fails, recursuvely run the same function but split the batch size up by 5. Keep doing this until the batch size is smaller than the minimum batch size (20). At the end, you'll end up actually having aound 5 entries fail, and one in there is bad. Even with retries and tail-recursion (or the Go equivalent). 

I was able to get 60k entries in 12 seconds and it didn't kill the database or hog the CPU.

jackielii:

for postgres & mssql, there are CopyIn api for do database specific bulk insert:

https://godoc.org/github.com/lib/pq#hdr-Bulk_imports

https://godoc.org/github.com/denisenkom/go-mssqldb#CopyIn

https://github.com/denisenkom/go-mssqldb/blob/master/examples/bulk/bulk.go

R2A2:

I agree this looks a little bit greedy on resources. We use a single transaction with multi-line inserts batched into groups.

We do something similar to this but with batching (mysql has limits for the number of parameters allowed in parameterized queries): https://stackoverflow.com/a/21112176/303698

Make sure to write some db-enabled tests so you can optimise batch sizes for performance.

natdm:

Users could put in a duplicate entry. Duplicate of something already in the database.

eyesoftheworld4:

If that's your concern, then instead of trying to handle this in Go, you should check and see if your database has a standard for performing an action when there's a duplicate key, for example in PostgreSQL, ON CONFLICT DO [ACTION] .... where action can be an update of the row in question, or do nothing. MySQL also has an implementation of this.

Your Go code shouldn't need to care about the data that the database takes care of; that's what databases are good for. Your Go code should just say "I need this data in the database" and then your database can take care of duplicates and conflicts. Unless, of course, you need to give the user immediate feedback about the data they're trying to input, but since you're trying to process many rows at once, it doesn't seem like this is the case.

schumacherfm:

Another idea is:

Create a CSV file on the HDD or in memory. Read that CSV file into MySQL. That is the fastest way to insert data.

You can create a "shadow table" to read the data in and then via a rename table query to replace the old table with the new one. Rename queries are atomic.

用户登录

今日阅读排行

一周阅读排行

最新主题