I'm trying to hash only a small part of a large file. My approach is "if the file is < 1M bytes, SHA256 the entire thing; otherwise, SHA256 the first 100K bytes, then 500K bytes starting at len(file)/2."
I wrote it in Ruby, which is hopefully readable even if you don't know Ruby:
def fingerprint(filename)
size = File.size(filename)
return Digest::SHA256.file(filename).hexdigest if size < 1_000_000
hash = Digest::SHA256.new
File.open(filename) do |f|
hash << f.read(100_000)
f.seek(size / 2, :SET)
hash << f.read(500_000)
end
hash.hexdigest
end
Here's my attempt to translate it into Go:
func fingerprint(filename string) [32]byte {
file, _ := os.Open(filename)
defer file.Close()
size := fileSize(file)
if size < 1000000 {
data := make([]byte, size)
file.Read(data)
return sha256.Sum256(data)
}
data := make([]byte, 600000)
io.ReadAtLeast(file, data[0:99999], 100000)
file.Seek(size/2, 0)
io.ReadAtLeast(file, data[100000:599999], 500000)
return sha256.Sum256(data)
}
I make a 600,000-byte buffer, pass a slice reference to the first 100,000 bytes to read at least 100,000 bytes, seek to halfway through the file, and do this again for the next 500,000 bytes.
My problem is that the exact same file will result in a hash 1358f4ce65f0d1ed482d572e4eac6ea90d465c0ab878f477297474f8f23226c3
for Go but f19ca1d8e6a68d539fe13a714b50e96a1480feb03cedeee5deeddafdd9d8b038
for Ruby. I tried to write the two approaches completely identically, but evidently there's something I messed up or didn't understand. I'm asking for help identifying what that was, since it's totally going over my head. Any tip would be really appreciated!
评论:
epiris:
Eraxley:Hashes impl io.Writer, you can use io.Copy and avoid such a massive buffer like that. If filesize is less than 1mb just io.Copy(hash, file), in other case use the fact file is a io.Seeker to use the same copy method. You would call Copy twice, once with your file wrapped in a LimitReader of 100k, then again after a call to file.Seek with a limitreader of 500k. After your final call to copy on either method call Sum with nil to get the hash. There are more efficient tricks we could use here but this is clear and correct.
danredux:The error is that you should reslice differently. You should write
io.ReadAtLeast(file, data[0:100000], 100000)
andio.ReadAtLeast(file, data[100000:600000], 500000)
. Read more about slicing slices here.Also, right now, both of those calls should be returning errors. Remember to handle your errors! You would have spotted this bug instantly, had you not ignored the errors.
I hope this helps!
jere_jones:Thanks for being the only one to answer the question at hand.
bonekeeper:This is the correct answer. You are reading in one less byte than you think you are.
Slices are [low:high] where high is the last element + 1.
mcastilho:FYI There's a way to calculate the hash as you write to it, no need send everything to a buffer.
Take a look at this article: http://marcio.io/2015/07/calculating-multiple-file-hashes-in-a-single-pass/
