Faster alternative to strings.Count?

polaris · · 664 次点击

这是一个分享于的资源，其中的信息可能已经有所发展或是发生改变。

Currently my program, which is based on string matching and replacement, spends about 98% of its time in strings.Count(). All of those calls are necessary, but it seems that strings.Count() in Go is very slow. In Python, the same algorithm goes over 5x as fast with s.count(). I've checked the file strings.go, and the strings.Count() code is implemented in Go, and just uses an ordinary string counting algorithm. Is there any faster way? I should also mention that these strings are Unicode so each character / rune may be multiple bytes. Edit: Code for the function declaration... <pre><code>func getels(text string, minlen int, maxlen int, threshold int) map[string]int </code></pre> Edit: ...and the function body... <pre><code> saved := make(map[string]int) width := maxlen left_i := 0 right_i := width-1 runes := []rune(text) text_len := len(runes) var count int var e, x string for width = maxlen; width >= minlen; width-- { x = strings.Repeat("\x1F", width) for right_i < text_len { e = string(runes[left_i:right_i]) if !strings.Contains(e, "\x1F") { count = strings.Count(text, e) if count >= threshold { saved[e] = count text = strings.Replace(text, e, x, -1) left_i += width right_i += width } else { left_i++ right_i++ } } else { left_i++ right_i++ } } width-- left_i = 0 right_i = width - 1 } return saved </code></pre> When fed a large string of Unicode characters, with 500,000+ characters or so, it spends over 99% of its time in strings.Count(). As stated previously, the Python version is shorter and much faster currently. It uses the same algorithm, but there is no rune/string stuff (just Python's native string type). <hr/>**评论：** SportingSnow21: <pre>BUG: Your code isn't necessarily returning what you're expecting it to return, because it runs the risk of an early sub-string overwriting another. Given text="abcdefgdefgabcd", maxlen=4, and threshold=2 as input, the substring "abcd" will clear out the first instance of "defg" simply because it appears earlier in the string. In this case, "defg" won't even be identified, because the <code>strings.Replace()</code> call will overwrite the 'd' and reduce it's sub-string count below the threshold. To be certain that all sub-strings are found, the underlying string shouldn't be modified. Instead, the identified sub-strings should be filtered to adjust the counts of those that are contained within other sub-strings. This lends itself to a concurrent searching go-routines with results and counts stored in a retrieval trie structure. Traversing the resulting trie would allow easy identification of those matches that are wholly contained as a prefix of other matches. Post-fix identification would need a bit more inventive approach, though. Additional note: Slicing a string is the same as converting to a rune slice and slicing that. Therefore, you don't need <code>runes := []rune(text)</code> and <code>e = string(runes[left_i:right_i])</code> can be replaced by <code>e = text[left_i:right_i]</code>. EDIT: Switching from a rune slice to indexing the string is almost a 3x gain. (I'm using your post as the input to the benchmark) <pre><code>BenchmarkRuneSlice-8 30 39231163 ns/op 362502 B/op 6225 allocs/op BenchmarkString-8 100 13277123 ns/op 260645 B/op 124 allocs/op </code></pre></pre>xena-warrior-prince: <pre><blockquote> Additional note: Slicing a string is the same as converting to a rune slice and slicing that. </blockquote> This is incorrect; maybe you're thinking about converting a string to a byte slice. See <a href="https://golang.org/ref/spec#Conversions_to_and_from_a_string_type" rel="nofollow">https://golang.org/ref/spec#Conversions_to_and_from_a_string_type</a> Given <code>s := "aはb"; r := []rune(s)</code>, <code>len(s) = 5</code>, <code>len(r) = 3</code> and <code>s[:2]</code> yields a string with broken utf-8 data.</pre>SportingSnow21: <pre>Ugh, you're right. I was too focused on the pattern interaction and forgot the utf-8 catch. </pre>masterpi: <pre>I can't think of a way to beat linear time for a single call, and the constants seem like they'd be trivially optimized as well, so I'm not really sure what's going on there. It seems like if you're calling it a lot on the same string you could benefit from some precomputation though. Just compute the totals at each index, and then subtract when you want the count between.</pre>jp599: <pre>Yeah, but the problem is that the input could be any Unicode string, and the is modified within the function as well.</pre>xena-warrior-prince: <pre>It might help to share more details about what you're doing or some code. <ul> <li>maybe a better algorithm is what's needed</li> <li>maybe you should be working with <code>[]byte</code></li> <li>maybe you can work with <code>[]rune</code> instead of strings directly</li> <li>maybe a regexp will do</li> <li>maybe you don't actually need to care about unicode at all</li> <li>etc.</li> </ul></pre>FUZxxl: <pre>What is that code supposed to do? Maybe there is a faster algorithm for the complete problem.</pre>jp599: <pre>The algorithm is a brute force way of finding all common substrings within a Unicode string, starting with the largest. It's kind of an ugly problem.</pre>FUZxxl: <pre>It's not particularly hard. I <a href="http://codegolf.stackexchange.com/a/45426/134" rel="nofollow">solved</a> a similar problem a while ago (where I wanted to count the number of unique substrings for each length instead of getting them). A fast solution is this: <ol> <li><a href="https://en.wikipedia.org/wiki/Suffix_array" rel="nofollow">suffix-sort</a> the array in O(n) or O(n log n) and compute an <a href="https://en.wikipedia.org/wiki/LCP_array" rel="nofollow">LCP array</a> at the same time.</li> <li>Observe the differences between each adjacent entry in the suffix sorted array, use the LCP array to cut down the time to find this difference. See the linked post on how to get unique substrings from that.</li> </ol> Due to the properties of UTF-8, you can pretend that the input uses a single-byte encoding. Then throw out all substrings you found that don't begin or end in a full UTF-8 sequence.</pre>FUZxxl: <pre>You could implement a variant of KMP for this purpose. Seems like the Go implementation uses a somewhat naïve algorithm (but that's okay for the intended purpose).</pre>pauldhankin: <pre>You're modifying "text" in the inside loop, but your code never reads text except at the beginning of the function. Did you mean to modify "runes" instead?</pre>elusivegopher: <pre>How many strings are you counting? Would HyperLogLog be an option?</pre>SportingSnow21: <pre>Rather than trying to hack the Count method, you could just put a map/reduce setup together and spread the heavy work across more cores: <pre><code>func getels(text string, minlen int, maxlen int, threshold int) (saved map[string]int) { saved = make(map[string]int) runes := []rune(text) text_len := len(runes) done := make(chan *map[string]int) for w := maxlen; w >= minlen; w-- { //Map go func(width int) { var count int var e string tmp := make(map[string]int) for left_i := 0; left_i+width < text_len-1; left_i++ { e = string(runes[left_i : left_i+width]) if _, ok := tmp[e]; ok { continue } count = strings.Count(text, e) if count >= threshold { tmp[e] = count } } done <- &tmp }(w) } for w := maxlen; w >= minlen; w-- { //Reduce for k, v := range *(<-done) { saved[k] += v } } close(done) for k, v := range saved { //Filter overlaps t := []rune(k) for i := minlen; i < len(t)-1; i++ { for j := 0; j+i <= len(t)-1; j++ { if tv, ok := saved[string(t[j:j+i])]; ok { if v >= tv { delete(saved, string(t[j:j+i])) } else { saved[string(t[j:j+i])] -= v } } } } } for k, v := range saved { //Filter below thresholds if v < threshold { delete(saved, k) } } return saved } </code></pre></pre>jasonmoo: <pre>Are you considering that <code>strings.Count</code> is called within <code>strings.Replace</code> as well? It could as much as double the amount of time spent in <code>Count</code>. <a href="https://golang.org/src/strings/strings.go#L681" rel="nofollow">https://golang.org/src/strings/strings.go#L681</a></pre>gohacker: <pre>1.py: <pre><code>def test(): 'abcdefghijklmnopqrstuvwxyz'.count('no') if __name__ == '__main__': import timeit seconds = timeit.timeit("test()", setup="from __main__ import test", number=20000000) sec_per_op = seconds / 20000000. ns_per_op = 10**9 * sec_per_op print(ns_per_op, "ns/op") </code></pre> <hr/> <pre><code>$ python 1.py 650.573050976 ns/op $ python3 1.py 307.5997867010301 ns/op </code></pre> <hr/> 1_test.go: <pre><code>package main import ( "strings" "testing" ) func Benchmark(b *testing.B) { for i := 0; i < b.N; i++ { strings.Count("abcdefghijklmnopqrstuvwxyz", "no") } } </code></pre> <hr/> <pre><code>$ go test -bench=. testing: warning: no tests to run PASS Benchmark-4 20000000 78.6 ns/op </code></pre> <hr/> Questions?</pre>purpleidea: <pre>op said: <blockquote> I should also mention that these strings are Unicode so each character / rune may be multiple bytes. </blockquote></pre>FUZxxl: <pre>That doesn't make a difference due to how UTF-8 is designed.</pre>gohacker: <pre>Let him/her show their own benchmarks. Talks are cheap.</pre>purpleidea: <pre>it is a good point that op should show code and data</pre>jp599: <pre>I'm using the CPU profiler and it's showing me over 99% of the time is spent in strings.Count(), not in my own code. Any string operation can seem fast when the string is tiny.</pre>SportingSnow21: <pre>Are you reading in as byte slices? Keeping them that way and using bytes.Count() might save you some cycles. I've seen a couple blog posts saying that helped. </pre>WellAdjustedOutlaw: <pre>OP mentioned the strings are Unicode, and therefore may contain multi-byte characters/runes. Byte counting would return an incorrect result, so those blogs are only valid for ASCII characters.</pre>SportingSnow21: <pre><a href="https://golang.org/pkg/bytes/#Count" rel="nofollow">bytes.Count</a> doesn't count bytes, it counts instances of a pattern within the byte slice. ASCII and Unicode are accepted in a byte slice, so the result wouldn't change.</pre>

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

664 次点击

加入收藏微博

slice

python

0 回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Faster alternative to strings.Count?

用户登录

今日阅读排行

一周阅读排行

最新主题