<p>Hello, I've been trying to improve performance of some code I wrote a while back,
a write-up about the problem and some optimization ideas I tried to work with can be found here: <a href="http://www.geeksforgeeks.org/find-the-smallest-positive-number-missing-from-an-unsorted-array/">http://www.geeksforgeeks.org/find-the-smallest-positive-number-missing-from-an-unsorted-array/</a></p>
<p>First - I am not educated in math and barely understand what O(n<sup>2)</sup> means.</p>
<p>I have implemented three of the methods suggested;</p>
<ol>
<li><p>SolutionBinarySearch(), uses a naive binary search and should have a time complexity of O(n<sup>2)</sup></p></li>
<li><p>SolutionLinear(), uses a linear search and should be O(nLogn + n)</p></li>
<li><p>SolutionHashing(), uses hashing, should be O(n)</p></li>
</ol>
<p>According to the article, they should perform like this:</p>
<ul>
<li>SolutionBinarySearch - slowest</li>
<li>SolutionLinear - faster than binary search</li>
<li>SolutionHashing - faster than linear search</li>
</ul>
<p>However, benchmarking my implementation shows</p>
<pre><code>$ go test -bench=. -benchtime=10s
BenchmarkSolutionLinear-8 5000 2781682 ns/op
BenchmarkSolutionHashing-8 200000 86691 ns/op
BenchmarkSolutionBinarySearch-8 1000000 18583 ns/op
</code></pre>
<p>This confuses me, it seems the binary search is fastest (it should be slowest?), and linear is super slow, it should be fast.</p>
<p>I am obviously doing this wrong, so I am just curious if anyone has ideas?</p>
<p>Here is my application: <a href="https://play.golang.org/p/SsmgzB3EVz">https://play.golang.org/p/SsmgzB3EVz</a></p>
<hr/>**评论:**<br/><br/>ImLopshire: <pre><p>Algorithmic complexity begins to matter when n is very large. In many cases, especially when n is small, a simpler solution with less overhead can be much faster.</p>
<p>I notice your benchmarks are only using a slice of 50k items. This makes your n 50k. Try increasing n substantially. There will likely be a value of n when the less algorithmic complexity solution is faster.</p></pre>birkbork: <pre><blockquote>
<p>I notice your benchmarks are only using a slice of 50k items. This makes your n 50k. Try increasing n substantially. There will likely be a value of n when the less algorithmic complexity solution is faster.</p>
</blockquote>
<p>I increased them to 5 million, still seeing the same pattern:</p>
<pre><code>$ go test -bench=. -benchtime=30s
BenchmarkSolutionLinear-8 100 411649700 ns/op
BenchmarkSolutionHashing-8 10000 3728741 ns/op
BenchmarkSolutionBinarySearch-8 20000 3070626 ns/op
</code></pre></pre>tsdtsdtsd: <pre><p>In your linear solution you have to sort your input first. Could it be that this makes the benchmark slower? Just a guess after a peek, as I can't try it out right now at work.</p></pre>birkbork: <pre><p>Possibly, but the link in OP suggests it should be faster;</p>
<blockquote>
<p>We can use sorting to solve it in lesser time complexity. We can sort the array in O(nLogn) time. Once the array is sorted, then all we need to do is a linear scan of the array. So this approach takes O(nLogn + n) time which is O(nLogn).</p>
</blockquote>
<p>Maybe I am just using too small arrays to notice, as <a href="/u/ImLopshire" rel="nofollow">/u/ImLopshire</a> noticed</p></pre>vhodges: <pre><p>I didn't look too long (5 minutes or so) but it looks like you've swapped the Linear and Binary Search implementations.</p></pre>birkbork: <pre><p>You're probably right, I am fumbling here :-)</p>
<p>That said, I tried following the descriptions from [1]</p>
<p>for binary search:</p>
<blockquote>
<p>search all positive integers, starting from 1 in the given array. We may have to search at most n+1 numbers in the given array. So this solution takes O(n<sup>2)</sup> in worst case.</p>
</blockquote>
<p>which i tried to implement in SolutionBinarySearch(),</p>
<p>and linear:</p>
<blockquote>
<p>We can sort the array in O(nLogn) time. Once the array is sorted, then all we need to do is a linear scan of the array. So this approach takes O(nLogn + n) time which is O(nLogn).</p>
</blockquote>
<p>which I tried to implement in SolutionLinear()</p>
<p>1: <a href="http://www.geeksforgeeks.org/find-the-smallest-positive-number-missing-from-an-unsorted-array/" rel="nofollow">http://www.geeksforgeeks.org/find-the-smallest-positive-number-missing-from-an-unsorted-array/</a></p></pre>CHAOSFISCH: <pre><p>First, your implementation contains many errors. I fixed the most obvious ones.
Your "BinarySearch" is essentially the "naive" algorithm?</p>
<p>For n = 25000000</p>
<pre><code>BenchmarkSolutionLinear-4 1 19230408500 ns/op
BenchmarkSolutionHashing-4 1 14998439500 ns/op
BenchmarkSolutionNaive-4 1 11039920000 ns/op
</code></pre>
<p>Results might be a bit off because of potential thermal throttling. I'd assume that for some larger n Hashing might be faster. However, it seems that the Naive algorithm works best for most cases here. Unless there's another unspotted bug or performance improvement possible within the respective algorithms.</p>
<p><del><a href="https://play.golang.org/p/P9g58_VNYt" rel="nofollow">https://play.golang.org/p/P9g58_VNYt</a></del>
<a href="https://play.golang.org/p/Yym9hqQLW7" rel="nofollow">https://play.golang.org/p/Yym9hqQLW7</a></p>
<p><strong>EDIT 1</strong>:
I've played a bit further with the code and performed a bit of profiling. </p>
<ul>
<li>Naive: 2/3 of time is spent on the for loop, 1/3 of time on the comparison.</li>
<li>Linear: 99% of time is spent on sorting the numbers</li>
<li>Hashing: 99% of time is spent on assigning entries to the map.</li>
</ul>
<p>Thus, only for very large n the naive version could be slower.</p></pre>birkbork: <pre><blockquote>
<p>Your "BinarySearch" is essentially the "naive" algorithm?</p>
</blockquote>
<p>Yes. I guess this means I don't know what a binary search is!</p>
<p>Very interesting profiling results. PS, you can increase the time the benchmark is run with <code>-benchtime=30s</code> or something, to get more accurate results with long running benchmarks. The default is 1 second.</p>
<p>I'll have a look at your version, I am very curious of all them errors I made :-) Thank you so much for your detailed feedback!</p></pre>CHAOSFISCH: <pre><p>A more updated version (fixed/improved the way benchmarks are done):
<a href="https://play.golang.org/p/Yym9hqQLW7" rel="nofollow">https://play.golang.org/p/Yym9hqQLW7</a></p></pre>JHunz: <pre><p>Your test set generation, while improved from the way OP was doing it, doesn't insert any negative numbers. This artificially boosts the naive solution</p></pre>CHAOSFISCH: <pre><p>Negative numbers have no effect and do not favor the <em>naive</em> solution (just tested it to verified it).
More important is the density of the slice, i.e., which is highest if we need <em>k</em> iterations to find the solution <em>k</em> (assuming positive numbers).</p>
<p>Thus, it highly depends on the input. </p>
<p>Is it sparse? -> <em>naive</em></p>
<p>Is it dense? -> <em>hashing</em> (according to O(n)). However, <em>linear</em> is faster because the creation of the hashmap is too expensive. </p>
<p>Using random numbers as done by OP will most likely favor <em>naive</em>.</p>
<p><strong>EDIT</strong>: I can benchmark only for up to <em>n=100000000</em>. After that my memory is insufficient.</p></pre>ChristophBerger: <pre><p>About the O() notation, the math may become clearer when looking at the related function plots for n<sup>2,</sup> log(n), 2<sup>n,</sup> etc. <a href="https://appliedgo.net/big-o/" rel="nofollow">Here</a> is a blog post I wrote about this a while ago.</p></pre>birkbork: <pre><p>Thank you for the link, the article looks really interesting!</p></pre>JHunz: <pre><p>There are a number of problems I see: </p>
<p>First, you're generating a test set that doesn't contain any negative numbers because rand.Int returns non-negative values. Having them in would very drastically affect the performance of the naive solution, so that benchmark is artificially sped up with your current benchmarking code. </p>
<p>Second, you're not actually handling the problem description correctly in some of your solutions. It specifies that negative numbers can be present in the array, but for example SolutionLinear will always return 1 on any array containing a negative number, whether or not 1 is present. This will make it artificially slightly faster once your test set contains negative numbers, although it would still be nLogn due to the sorting happening first.</p>
<p>Third, you are handling the problem description incorrectly in another way by capping the values you search for i at 1 million in SolutionBinarySearch and SolutionHashing. This makes SolutionBinarySearch artificially fast at n>1000000 because its complexity caps at 1000000*n rather than n squared. </p>
<p>So I think most of the discrepancy you're seeing is a discrepancy in what you thought you were testing vs what you were actually testing. </p>
<p>Edit: Also, go has a map type which is basically a hash table already. You'd probably see better performance out of the hashing solution as well as not having to allocate a specific size up front by switching to using that rather than using a slice.</p></pre>birkbork: <pre><p>Thanks for all the pointers, will have another go at this</p>
<blockquote>
<p>Edit: Also, go has a map type which is basically a hash table already. You'd probably see better performance out of the hashing solution as well as not having to allocate a specific size up front by switching to using that rather than using a slice.</p>
</blockquote>
<p>Did try this initially, using map[int]bool, but it was considerably slower than the solution used in the OP.
This might be because I only tested it with arrays with 50k elements. Should re-try.</p></pre>birkbork: <pre><p>Did another test regarding using map. It seems to be much slower:</p>
<pre><code>BenchmarkSolution/Hashing-8 20 72253585 ns/op
BenchmarkSolution/HashingMap-8 1 22666093134 ns/op
BenchmarkSolution/HashingMapNoInterface-8 1 21006878796 ns/op
</code></pre>
<p>Where SolutionHashingMap() is by <a href="/u/CHAOSFISCH" rel="nofollow">/u/CHAOSFISCH</a></p>
<p>and SolutionHashingMapNoInterface() is slightly modified to use map[int]bool instead.</p>
<pre><code>// uses hasing, O(n)
func SolutionHashing(A []int) int {
hash := make([]bool, 1000000)
for _, val := range A {
if val > 0 && val < 100000 {
hash[val] = true
}
}
for i := 1; i < len(A)+1; i++ {
if !hash[i] {
return i
}
}
return len(A) + 1
}
// uses hasing, O(n), with map
func SolutionHashingMap(A []int) int {
var a struct{}
hash := make(map[int]struct{}, len(A))
for _, val := range A {
hash[val] = a
}
for i := 1; i < len(A)+1; i++ {
if _, ok := hash[i]; !ok {
return i
}
}
return len(A) + 1
}
// uses hasing, O(n), with map
func SolutionHashingMapNoInterface(A []int) int {
hash := make(map[int]bool, len(A))
for _, val := range A {
hash[val] = true
}
for i := 1; i < len(A)+1; i++ {
if !hash[i] {
return i
}
}
return len(A) + 1
}
</code></pre>
<p>updated code here, based on changes from <a href="/u/CHAOSFISCH" rel="nofollow">/u/CHAOSFISCH</a> <a href="https://play.golang.org/p/7Ha98L8eep" rel="nofollow">https://play.golang.org/p/7Ha98L8eep</a></p></pre>CHAOSFISCH: <pre><p>Well, sure the solution you propose using a fixed size <em>[]bool</em> is faster. To some extent you're cheating. My solutions work for all numbers <em>MinInt64</em> to <em>MaxInt64</em>. You're limiting it to 100000.</p></pre>birkbork: <pre><p>True. The 100k limit is mentioned in <a href="https://codility.com/demo/take-sample-test/" rel="nofollow">https://codility.com/demo/take-sample-test/</a>, not the description posted in OP.</p></pre>
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889
- 请尽量让自己的回复能够对别人有帮助
- 支持 Markdown 格式, **粗体**、~~删除线~~、
`单行代码`
- 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
- 图片支持拖拽、截图粘贴等方式上传