Picking a database for a project

Hello, I'm learning Go while building a little project on the side. I have to say Go has been awesome to learn in its own right, and an awesome tool for learning how to do more than frontend stuff in JS.

I need a db for my project now, and everywhere I look, people say I should pick the right db for the job; only I don't really know the characteristics of each db.

I do know the project though. So maybe if I share it with you awesome people, you can give me some pointers

I will not "build" my own data up, starting with nothing.
Instead I will get all the data as xml files that I'll need to parse to structs (once a day'ish).
I will never change parts of these structs (say, update the name of some item).
There are around 3-4 tables of data that do relate to each other.
I will need to make queries into the data. Like "Give me all the red flowers", or "cars with more than 2 doors".
We are talking >1,000,000 entries.
... and many queries /sec.
Not sure if it matters, but the entry point will be web.

My initial thought was Postgres because of the relations between the data, but since the data is not built (and validated) by me the performance hit might not be worth it.

Next I considered Mongodb because it can handle a lot of data fairly well. But it seems wrong to run back to JS for me.

Then, as part of looking into the different dbs, I played around with Boltdb, but a key/value store does not seem correct for queries into random parts of the data. (Would you do that with bloom filters and then run through every single entry in a loop for every single query?)

Now, writing this, I wonder if I'm foolish even putting stuff in a db. If I could work with just the raw .xml files?

Any help with this would be awesome. I am a little overwhelmed by the number of dbs to pick from.

Thanks!

评论：

very-little-gravitas:

Just use Postgresql, don't worry about performance unless you actually see a problem (you won't). The only hard part will be getting data in (not so hard, just import, convert to sql and insert), after that it'll be very easy and very fast to search through it. You're correct in thinking trying to put relational data into a document database will not end well. I think you need the db if you want fast queries over millions of documents.

jerf:

I'd actually expect performance issues with Postgres, not "it can't do it" or anything, but just because that is enough data to probably need an index. I recommend this resource about indexing, as it is designed for use by devs in pretty much exactly this use case.

But Postgres is still the right choice here, and it's hard to overstate the utility of learning how to use relational databases for this task.

brentadamson:

I agree w/ the others w/ Postgresql. Depending on the data you might want to try Postgresql's jsonb. I would also recommend using an interface so that if you decide to switch databases later then the transition will be much easier. See https://appliedgo.net/di/

Golanq:

I guess I'll go with postgres then. Thanks for the responses guys. And thanks for the link. I need to do some reading up :-)

karma_vacuum123:

start out with something well-known like Postgres or MySQL

you will almost certainly never get anywhere close to the performance limits for these tools given a reasonably well-designed schema and reasonable hardware

a decently provisioned MySQL server should scale to thousands of txns per second

it will be more important for you to be able to find good help online and good libraries

Floooof:

Give ElasticSearch a shot. Your updates are once per day, so ACID transaction support isn't a necessity. It sounds like you're loading this data only to query it, and your query examples are faceted search terms. If you need to do any sort of merging or upsort to append the daily data files to what's already present, you may want to load the data into a relational database or other intermediate system of record for pre-processing before building your search indexes in elasticsearch, but if your daily files always contain all the data, just load them into elasticsearch and your project is mostly done. It's super easy to cluster (if your data or query volume grows) especially with your once daily offline write and 100% read workload. This is exactly what databases such as this were built to handle.

Golanq:

I seriously thought ElasticSearch was a service and not a db. I'll do some reading up.

Didn't know the term "faceted search terms". Thanks! More reading up :)

I do need to merge data (if my understanding is correct); say, add the manufactor to the red cars when querying for them.

Yes, the data I get is all of the data, and I shouldn't need to validate it as such.

I'll do some reading up on ElasticSearch and see where that takes me. Sleeping upon this question though, made me realise that exactly because the data gets flushed every day, it wouldn't be too difficult to change in the future in case I pick the "wrong" option.

[deleted]

bass_case:

What do you mean how is it overwhelming? Do you live under a rock? There's been new db's popping up every month for the last 2 years. It's certainly overwhelming.

[deleted]

Golanq:

I think you are forgetting that I am learning as I go. I find it overwhelming because I'm new to dbs in general. I've tried a little mongo and tried a little postgres and mysql; but always as the frontend guy that sometimes would make my own endpoint to query for something.

Alone in this thread (with a very specific use case), there are three different ones suggested to me, and one of them (elastic search) I thought was a service and not a db.

[deleted]

Golanq:

I actually went with MySQL for my previous side project (an RSS reader with a web interface) and I've been generally happy with it (except that one time where I had to learn about utf8mb4).

I think I'll go with Postgres now though. It seems like a really good "general purpose" db. Has lots of help and documentation. We use it at work already. And I get to use the excellent sqlx library. :)

bass_case:

I agree with you, many aren't fit. Yet, many are. One example of this is choosing a time series db. There's at least 5+ production ready choices out there.

In general there are lots of databases these days for a variety of use cases and it IS overwhelming to choose from. Consider the same question from the point of view of choosing the correct ML algorithm for a particular problem. Sure there are various sets of problems like regression / classification / dimensionality / clustering but its really difficult choosing the algorithm that models your data unless you have a high level view / knowledge of all of them. For example inside of the classification problem set you have algo choies like: bayes, kneighbors, svc, ensemble, kernel approx, sgd, and linear svc. Overwhelming!

R2A2:

MongoDB might be great for this. Document databases are generally a good match for XML. The MGO driver is makes persisting structs super easy. You define struct tags just like with XML/json

用户登录

今日阅读排行

一周阅读排行

最新主题