I'm trying to do duplicate detection in my photo library and I was experimenting with using ImageMagick's identify -verbose
tool to get a signature (SHA256 hash) of the pixel data.
A problem arose when I ran it on my Mac, which used the Homebrew distribution of LibRaw (the latest; 0.21.1) but I was getting mismatches on my Ubuntu 22.04 machine (which gets 0.20.2 of LibRaw). Once I built 0.21.1 on the Linux machine, the signatures were identical as expected, so I don't think it's an ImageMagick issue.
I'm posting here because I want to understand if I was wrong in expecting LibRaw to produce stable output in terms of the raw pixel data; i.e., the same file will yield the same pixel data regardless of which version of LibRaw you use. (If I'm using the wrong term when I say "raw pixel data", please educate me. :) )
If the answer is "sorry, the pixels you get are subject to change according to how our raw processing logic evolves" that's fine -- but I would be curious to know if there's a way to get something that is entirely stable out of LibRaw so I can compute a hash from that instead. To be clear, I don't need to do any image processing -- I just want to know if the image data is identical even if metadata got changed, like the capture time or something.
Sorry, we do not know how
Sorry, we do not know how identify -verbose tool works. Is that possible that it checksums not RAW data but rendered image?
-- Alex Tutubalin @LibRaw LLC
RAW data vs. rendered image
Ahh thanks, I think you just helped me understand something.
I haven't checked, but I'm almost certain you're right about it using the rendered image. My understanding is that ImageMagick delegates all the decoding to libraw, libjpeg, libtiff, libpng, etc., and in my case it doesn't necessarily know it's dealing with a RAW image by the time it creates the signature.
So let's say I wanted to write my own signature program using LibRaw that only operates on the image data, leaving the metadata completely out of it. After a quick look at the API, my best guess is that I'd want to hash the contents of libraw_rawdata_t. Does that sound right?
Yes, one of *image pointer
Yes, one of *image pointer in libraw_rawdata_t will be non-zero after LibRaw::unpack() and will contain imgdata.sizes.raw_height rows, imgdata.sizes.raw_width items each, with imgdata.sizes.raw_pitch byte pitch.
-- Alex Tutubalin @LibRaw LLC
Thank you very much. This is
Thank you very much. This is all new to me but I'm eager to get into it.
One last question - I'm completely new to working with RAW processing, but I've been a coder for 25 years. Can you recommend any conceptual documentation or reference material that will help me understand RAW processing better? There are lots of search results, but if you have a recommended resource I would love to know about it. (Apologies if it's on page 1 of your documentation and I just missed it.)
Thanks again!
It is difficult to recommend
It is difficult to recommend any specific document/site. May be 'dcraw annotated' will help: https://ninedegreesbelow.com/files/dcraw-c-code-annotated-code.html
(at least, LibRaw postprocessing is derived from dcraw.c code, so this specific document is applicable to LibRaw::dcraw_process() code)
-- Alex Tutubalin @LibRaw LLC
Thank you. This is helpful.
Thank you. This is helpful.
My solution
I stripped down the
unprocessed_raw
sample and achieved what I want (I think) by piping the output through a hash utility likesha256sum
orxxh128sum
.Posting here in case it helps someone later:
Compile with
g++ rawbytes.cpp -o rawbytes -Ofast -lraw -lm
.