Same file-reading algorithm, different results in Go and Ruby. What am I doing wrong?

agolangf · · 187 次点击    
这是一个分享于 的资源,其中的信息可能已经有所发展或是发生改变。
<p>I&#39;m trying to hash only a small part of a large file. My approach is &#34;if the file is &lt; 1M bytes, SHA256 the entire thing; otherwise, SHA256 the first 100K bytes, then 500K bytes starting at len(file)/2.&#34;</p> <p>I wrote it in Ruby, which is hopefully readable even if you don&#39;t know Ruby:</p> <pre><code>def fingerprint(filename) size = File.size(filename) return Digest::SHA256.file(filename).hexdigest if size &lt; 1_000_000 hash = do |f| hash &lt;&lt; / 2, :SET) hash &lt;&lt; end hash.hexdigest end </code></pre> <p>Here&#39;s my attempt to translate it into Go:</p> <pre><code>func fingerprint(filename string) [32]byte { file, _ := os.Open(filename) defer file.Close() size := fileSize(file) if size &lt; 1000000 { data := make([]byte, size) file.Read(data) return sha256.Sum256(data) } data := make([]byte, 600000) io.ReadAtLeast(file, data[0:99999], 100000) file.Seek(size/2, 0) io.ReadAtLeast(file, data[100000:599999], 500000) return sha256.Sum256(data) } </code></pre> <p>I make a 600,000-byte buffer, pass a slice reference to the first 100,000 bytes to read at least 100,000 bytes, seek to halfway through the file, and do this again for the next 500,000 bytes.</p> <p>My problem is that the exact same file will result in a hash <code>1358f4ce65f0d1ed482d572e4eac6ea90d465c0ab878f477297474f8f23226c3</code> for Go but <code>f19ca1d8e6a68d539fe13a714b50e96a1480feb03cedeee5deeddafdd9d8b038</code> for Ruby. I tried to write the two approaches completely identically, but evidently there&#39;s something I messed up or didn&#39;t understand. I&#39;m asking for help identifying what that was, since it&#39;s totally going over my head. Any tip would be really appreciated!</p> <hr/>**评论:**<br/><br/>epiris: <pre><p>Hashes impl io.Writer, you can use io.Copy and avoid such a massive buffer like that. If filesize is less than 1mb just io.Copy(hash, file), in other case use the fact file is a io.Seeker to use the same copy method. You would call Copy twice, once with your file wrapped in a LimitReader of 100k, then again after a call to file.Seek with a limitreader of 500k. After your final call to copy on either method call Sum with nil to get the hash. There are more efficient tricks we could use here but this is clear and correct.</p></pre>Eraxley: <pre><p>The error is that you should reslice differently. You should write <code>io.ReadAtLeast(file, data[0:100000], 100000)</code> and <code>io.ReadAtLeast(file, data[100000:600000], 500000)</code>. Read more about slicing slices <a href="" rel="nofollow">here.</a></p> <p>Also, right now, both of those calls should be returning errors. Remember to handle your errors! You would have spotted this bug instantly, had you not ignored the errors.</p> <p>I hope this helps!</p></pre>danredux: <pre><p>Thanks for being the only one to answer the question at hand.</p></pre>jere_jones: <pre><p>This is the correct answer. You are reading in one less byte than you think you are.</p> <p>Slices are [low:high] where high is the last element + 1.</p> <p><a href="" rel="nofollow"></a></p></pre>bonekeeper: <pre><p>FYI There&#39;s a way to calculate the hash as you write to it, no need send everything to a buffer.</p></pre>mcastilho: <pre><p>Take a look at this article: <a href="" rel="nofollow"></a></p></pre>

入群交流(和以上内容无关):加入Go大咖交流群,或添加微信:liuxiaoyan-s 备注:入群;或加QQ群:692541889

187 次点击  
加入收藏 微博
添加一条新回复 (您需要 登录 后才能回复 没有账号 ?)
  • 请尽量让自己的回复能够对别人有帮助
  • 支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
  • 支持 @ 本站用户;支持表情(输入 : 提示),见 Emoji cheat sheet
  • 图片支持拖拽、截图粘贴等方式上传