Same file-reading algorithm, different results in Go and Ruby. What am I doing wrong?

agolangf · 2017-07-21 00:00:15 · 579 次点击

这是一个分享于 2017-07-21 00:00:15 的资源，其中的信息可能已经有所发展或是发生改变。

I'm trying to hash only a small part of a large file. My approach is "if the file is < 1M bytes, SHA256 the entire thing; otherwise, SHA256 the first 100K bytes, then 500K bytes starting at len(file)/2."

I wrote it in Ruby, which is hopefully readable even if you don't know Ruby:

def fingerprint(filename)
  size = File.size(filename)
  return Digest::SHA256.file(filename).hexdigest if size < 1_000_000

  hash = Digest::SHA256.new
  File.open(filename) do |f|
    hash << f.read(100_000)
    f.seek(size / 2, :SET)
    hash << f.read(500_000)
  end

  hash.hexdigest
end

Here's my attempt to translate it into Go:

func fingerprint(filename string) [32]byte {
    file, _ := os.Open(filename)
    defer file.Close()
    size := fileSize(file)

    if size < 1000000 {
        data := make([]byte, size)
        file.Read(data)
        return sha256.Sum256(data)
    }

    data := make([]byte, 600000)
    io.ReadAtLeast(file, data[0:99999], 100000)
    file.Seek(size/2, 0)
    io.ReadAtLeast(file, data[100000:599999], 500000)
    return sha256.Sum256(data)
}

I make a 600,000-byte buffer, pass a slice reference to the first 100,000 bytes to read at least 100,000 bytes, seek to halfway through the file, and do this again for the next 500,000 bytes.

My problem is that the exact same file will result in a hash 1358f4ce65f0d1ed482d572e4eac6ea90d465c0ab878f477297474f8f23226c3 for Go but f19ca1d8e6a68d539fe13a714b50e96a1480feb03cedeee5deeddafdd9d8b038 for Ruby. I tried to write the two approaches completely identically, but evidently there's something I messed up or didn't understand. I'm asking for help identifying what that was, since it's totally going over my head. Any tip would be really appreciated!

评论：

epiris:

Hashes impl io.Writer, you can use io.Copy and avoid such a massive buffer like that. If filesize is less than 1mb just io.Copy(hash, file), in other case use the fact file is a io.Seeker to use the same copy method. You would call Copy twice, once with your file wrapped in a LimitReader of 100k, then again after a call to file.Seek with a limitreader of 500k. After your final call to copy on either method call Sum with nil to get the hash. There are more efficient tricks we could use here but this is clear and correct.

Eraxley:

The error is that you should reslice differently. You should write io.ReadAtLeast(file, data[0:100000], 100000) and io.ReadAtLeast(file, data[100000:600000], 500000). Read more about slicing slices here.

Also, right now, both of those calls should be returning errors. Remember to handle your errors! You would have spotted this bug instantly, had you not ignored the errors.

I hope this helps!

danredux:

Thanks for being the only one to answer the question at hand.

jere_jones:

This is the correct answer. You are reading in one less byte than you think you are.

Slices are [low:high] where high is the last element + 1.

https://golang.org/ref/spec#Slice_expressions

bonekeeper:

FYI There's a way to calculate the hash as you write to it, no need send everything to a buffer.

mcastilho:

Take a look at this article: http://marcio.io/2015/07/calculating-multiple-file-hashes-in-a-single-pass/

入群交流（和以上内容无关）：加入Go大咖交流群，或添加微信：liuxiaoyan-s 备注：入群；或加QQ群：692541889

579 次点击

加入收藏微博

ruby

slice

0 回复

暂无回复

添加一条新回复（您需要登录后才能回复没有账号？）

请尽量让自己的回复能够对别人有帮助
支持 Markdown 格式, **粗体**、~~删除线~~、`单行代码`
支持 @ 本站用户；支持表情（输入 : 提示），见 Emoji cheat sheet
图片支持拖拽、截图粘贴等方式上传

Same file-reading algorithm, different results in Go and Ruby. What am I doing wrong?

用户登录

今日阅读排行

一周阅读排行

最新主题