Does Anyone Make a Photo De-Duplicator For Linux? Something That Reads EXIF?

"Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible." -- source: http://linux.slashdot.org/story/14/01/23/2227241 Worthwhile reading the comments, mentioning various tools. Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cms.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174

fdupes does a great job if it's just one filesystem. Can delete the dupes or just hard-link which is sometimes more useful. But before I bothered to look and see that someone had already solved my problem, I wrote a script (in bash) that would list all the md5sums and paths in two columns, sort by md5 and delete all but the first of each. It wouldn't be hard to do this for 11 filesystems, just extend it a bit so that you include the drive it came from, the date perhaps, apply whatever logic you want to the lists and generate 11 scripts as output that get run back on the original machines. On 24 January 2014 15:16, Peter Reutemann <fracpete(a)waikato.ac.nz> wrote:
"Imagine having thousands of images on disparate machines. many are dupes, even among the disparate machines. It's impossible to delete all the dupes manually and create a singular, accurate photo image base? Is there an app out there that can scan a file system, perhaps a target sub-folder system, and suck in the images-- WITHOUT creating duplicates? Perhaps by reading EXIF info or hashes? I have eleven file systems saved, and the task of eliminating dupes seems impossible."
-- source: http://linux.slashdot.org/story/14/01/23/2227241
Worthwhile reading the comments, mentioning various tools.
Cheers, Peter -- Peter Reutemann, Dept. of Computer Science, University of Waikato, NZ http://www.cms.waikato.ac.nz/~fracpete/ Ph. +64 (7) 858-5174 _______________________________________________ wlug mailing list | wlug(a)list.waikato.ac.nz Unsubscribe: http://list.waikato.ac.nz/mailman/listinfo/wlug

On Fri, 24 Jan 2014 15:58:09 +1300, Bruce Kingsbury wrote:
But before I bothered to look and see that someone had already solved my problem, I wrote a script (in bash) that would list all the md5sums and paths in two columns, sort by md5 and delete all but the first of each.
If you want to speed things up, you could be lazy about computing the hash of each file’s contents. Start by just recording the length of each file; only if you find two files the same length, do you need to compute their respective content hashes to check for a match. This is the point where I would give up on bash and resort to Python.

On Fri, Jan 24, 2014 at 03:58:09PM +1300, Bruce Kingsbury wrote:
fdupes does a great job if it's just one filesystem.
Not on images. What if one image has extra metadata added to it but is otherwise exactly the same as another? What if you have two copies of the same image and one is in TIFF format and the other in JPEG? What if one is a slight modification of the other, say a contrast adjustment for better viewing? Or, say, detect that they are the same image but that the JPEG has been created from the original TIFF and thus is the lower quality image due to lossy compression and therefore keep the TIFF. I would want a photo de-duplicator to be able to detect and report such situations. Cheers Michael.

On Fri, 24 Jan 2014 16:26:24 +1300, Michael Cree wrote:
What if one is a slight modification of the other, say a contrast adjustment for better viewing?
Maybe this <http://packages.debian.org/wheezy/findimagedupes> might help.

There are better programs, but the tradeoff is speed. fdupes is quite fast. I hadn't really thought about just comparing file sizes and then md5sum the ones that are the same size, that would speed things up a lot. And it's probably what fdupes already does. On 24 January 2014 16:35, Lawrence D'Oliveiro <ldo(a)geek-central.gen.nz>wrote:
On Fri, 24 Jan 2014 16:26:24 +1300, Michael Cree wrote:
What if one is a slight modification of the other, say a contrast adjustment for better viewing?
Maybe this <http://packages.debian.org/wheezy/findimagedupes> might help. _______________________________________________ wlug mailing list | wlug(a)list.waikato.ac.nz Unsubscribe: http://list.waikato.ac.nz/mailman/listinfo/wlug

On Fri, 24 Jan 2014 16:26:24 Michael Cree wrote:
On Fri, Jan 24, 2014 at 03:58:09PM +1300, Bruce Kingsbury wrote:
fdupes does a great job if it's just one filesystem.
Not on images. What if one image has extra metadata added to it but is otherwise exactly the same as another? What if you have two copies of the same image and one is in TIFF format and the other in JPEG? What if one is a slight modification of the other, say a contrast adjustment for better viewing? Or, say, detect that they are the same image but that the JPEG has been created from the original TIFF and thus is the lower quality image due to lossy compression and therefore keep the TIFF. I would want a photo de-duplicator to be able to detect and report such situations.
I use "similar." http://sourceforge.net/projects/stic/
participants (5)
-
Bruce Kingsbury
-
Lawrence D'Oliveiro
-
Michael Cree
-
Peter Reutemann
-
Wayne Rooney