Writing a text indexer, tips?

I'm in need of writing a text indexer which operates similar to a basic SQL database. Ie, rows/objects of data, and the ability to perform basic queries against that data like columnA==5 && columnB == 6. The rows will be schemaless as well.

With that said, i've never implemented this sort of project before, and am curious if anyone has any good writeups that might help me get started on this.

Ignorantly if you were going to implement this you could iterate over a set of objects, such as might be stored in boltdb, and hit them one by one ignoring the ones that don't match and returning the ones that do match. However this seems naive at best haha.

Anyway, i realize this is a very open ended question, quite in the nature of "do my homework for me", but i don't mean it like that at all. Just looking for any pointers from anyone who has experience in dealing with fairly large datasets and query operations like this. A couple notes:

I'm not doing this because i want to, but because all other solutions i've found are either not pure-Go, or are unable to index the type of data i have (schema-less maps. Eg, Storm is out, but with some work i could use something like QL).
I realize the performance of something ignorantly coded is not likely to be up to snuff. However if i can match through one or two million objects in a sane timeframe i'll consider that "good enough". Beyond that i would need something non-Go, like SQLite or a proper DB.

Thanks to any replies!

评论：

lobster_johnson:

Your question is odd, since the first and only example in your post is not about text, but some basic integer matching, and you don't even mention text (which is a separate can of worms to just matching by equality — tokenization, ranking, etc.) other than in your first sentence. Are you really indexing text?

If so, have you looked at Bleve? It's a text-indexing library.

If your data doesn't need to be on disk, consider a column-store-type approach stored in RAM. The cache efficiency potentially allows you to race through huge amounts of data in a short time without having to use complicated indexes such as B-trees.

PhoRaptor:

Your question is odd, since the first and only example in your post is not about text, but some basic integer matching, and you don't even mention text (which is a separate can of worms to just matching by equality — tokenization, ranking, etc.) other than in your first sentence. Are you really indexing text?

Apologies. I'm "indexing" text, integers, bools, time, and possibly a couple other data formats i'm not thinking of at the moment.

Keep in mind i'm not concerned about full text indexing, because i planned on using Bleve for that. However using Bleve alone doesn't quite have the set of SQL-like features. I've actually already used Bleve in the past for this task, and wasn't able to implement the querying requirements i needed. Eg, iirc Bleve made some SQL-like queries difficult, but i can't remember which specific operator is missing offhand. With that said, i should probably whip up a quick implementation in Bleve as to know what is lacking. Perhaps i could augment Bleve more easily than writing my own.

If your data doesn't need to be on disk, consider a column-store-type approach stored in RAM. The cache efficiency potentially allows you to race through huge amounts of data in a short time without having to use complicated indexes such as B-trees.

I'm likely going to leave the storage up to something else, ie BoltDB - So in-memory may or may not be on the table. Granted, i would like to plan for not in memory, since the indexed datasets might be a bit too large.

Your BTree example is good though, that would have been a far better question to ask. Which Compsci data structures lend themselves to searching and filtering large datasets? Seems like that is really what i'm asking.

Appreciate your reply!

lobster_johnson:

All right, I think I have a better understanding now.

Re BoltDB, keep in mind that it's just a key/value store. It's good for storing data, but not for querying on anything except for a value's key.

It's hard to use a key/value store as a general-purpose index on collections of data. For example, let's say you wanted to index data on an integer column called age. You could use the key age:. So for a record with age=42, you'd store it as age:42. Each BoltDB value would be a list of record IDs that matched the value.

But that presents several problems. One is that in order to update the index (add or remove records), you'll have to write the entire age:42 key every time. If there are 1 million records with age=42, that's a huge value to write. Secondly, the lookup becomes extremely slow; there's no way to stream the lookup.

You could partition it into buckets (e.g. age::42, but then you need to keep track of the sizes of the buckets so you know where to insert. It's a can of worms, and probably not the path to good performance.

A system like this would couple tightly to BoltDB's key lookup mechanism. The efficiency of this approach depends on your ability to map values to keys. Of course, for a long time Lucene did pretty well by indexing integers as strings.

A better way, in my opinion, would be to use BoltDB as a transactional page management system, and implement your own indexing on top of this. This is roughly how modern databases work.

In this system, your data structures would be serialized to disk in the form of "pages". A single page contains index nodes in some efficient structure, limited by size (e.g. 4K). When saved to BoltDB, they'd be serialized in a compact binary format, and deserialized when read back. (The size limit should be set according to how expensive it is to modify just one entry in the page, which is the worst case. You don't it to be large.)

If you give each page an ID, the index nodes can then refer to each other by page ID.

You can implement a B-tree structure — or any other indexing data structure — on top of a page system like this. The stupidest algorithm you can do is probably a plain, unbalanced binary tree, for example. Each node has a left/right pointer, which is a [page_id, node_index] tuple. Leaf nodes then have values. Binary trees will eventually get too unbalanced, and aren't efficiently packed. B-trees are pretty much ideal for page-oriented systems because they can pack many leaf nodes together in a single node.

The benefit of using BoltDB here, and not inventing your own disk storage, is that you get transactions for free. Last I checked, BoltDB implemented an MVCC-type transactional system, which means it's got great concurrency and is crash-proof.

But the reason I mentioned indexing in RAM is that if you use a columnar approach, you can zip through the columns extremely fast — RAM is so fast that you don't necessarily even need "fancy" structures like B-trees, which are complex to implement, especially if you can parallelize queires.

You can even mmap the index as a disk file. This is how kdb, which is one of the world's fastest column databases, works. I believe BoltDB also uses mmap on Unix to put its data files into RAM.

The downside, of course, is that if you don't do something mmap, you need to reindex every time your program starts. There are some projects that do this, and rely on clustering (redundant nodes) to reduce the chance of a restart incurring any downtime.

(But if you do choose to use disk, don't invent your own storage mechanism.)

Lastly: This is a big topic, and I only lightly brushed the surface of it in the above.

I recommend getting some proper books on this, because there are so many solved problems that have clever solutions that you don't want to reinvent, poorly. I'll see if I can come up with titles for you. I remember An Introduction to Database Systems as being decent, but it doesn't go into much detail about the physical storage of databases. There's a lot of information online, too, of course; but some older works, still in paper book form, are a lot better.

You'll also something that goes into relational algebra. Modern systems use relational algebra to split query execution into several phases, creating a logical execution plan that can be optimized into a physical plan. For example, imagine someone queries on age > 42 and name = "bob". If there's only one person named "bob" which can be found using a single index lookup, it would be a bad idea to also run age > 42, because all other predicates can be executed directly on the fetched row.

xiegeo:

You can use SQLite in Go too. Such as https://github.com/mattn/go-sqlite3

PhoRaptor:

In Go yes, but not as pure Go unfortunately. Eg, cross compiling is troublesome, unfortunately. Oddly enough, there is a pure Go SQLite in the works. :)

xiegeo:

Only if easy cross compiling is that important to you. There has been a pure Go version in the works for years, so I wouldn't wait for that.

I had too looked for pure Go solutions in the beginning, but i don't think there is enough benefit for anyone to rewrite SQLite in Go, and make it production quality. Personally I don't find cross compiling critical, if I need to support a platform, then I need have access to an instance of that platform for debugging anyway.

用户登录

今日阅读排行

一周阅读排行

最新主题