API Documentation

Find Dupes Fast By Stephan Sokolow (ssokolow.com)

A simple script which identifies duplicate files several orders of magnitude more quickly than fdupes by using smarter algorithms.


Todo

Figure out how to do ePyDoc-style grouping here without giving up automodule-level comfort.

fastdupes.CHUNK_SIZE = 65536

Size for chunked reads from file handles

fastdupes.DEFAULTS = {'min_size': 25, 'exclude': ['*/.svn', '*/.bzr', '*/.git', '*/.hg'], 'delete': False}

Default settings used by optparse and some functions

fastdupes.HEAD_SIZE = 16384

Limit how many bytes will be read to compare headers

class fastdupes.OverWriter(fobj)[source]

Bases: object

Output helper for handling overdrawing the previous line cleanly.

write(text, newline=False)[source]

Use \r to overdraw the current line with the given text.

This function transparently handles tracking how much overdrawing is necessary to erase the previous line when used consistently.

Parameters:
  • text (str) – The text to be outputted
  • newline (bool) – Whether to start a new line and reset the length count.
fastdupes.compareChunks(handles, chunk_size=65536)[source]

Group a list of file handles based on equality of the next chunk of data read from them.

Parameters:
  • handles – A list of open handles for file-like objects with otentially-identical contents.
  • chunk_size – The amount of data to read from each handle every time this function is called.
Returns:

Two lists of lists:

  • Lists to be fed back into this function individually
  • Finished groups of duplicate paths. (including unique files as single-file lists)

Return type:

(list, list)

Attention

File handles will be closed when no longer needed

Todo

Discard chunk contents immediately once they’re no longer needed

fastdupes.delete_dupes(groups, prefer_list=None, interactive=True, dry_run=False)[source]

Code to handle the --delete command-line option.

Parameters:
  • groups (iterable) – A list of groups of paths.
  • prefer_list – A whitelist to be compiled by multiglob_compile() and used to skip some prompts.
  • interactive (bool) – If False, assume the user wants to keep all copies when a prompt would otherwise be displayed.
  • dry_run (bool) – If True, only pretend to delete files.

Todo

Add a secondary check for symlinks for safety.

fastdupes.find_dupes(paths, exact=False, ignores=None, min_size=0)[source]

High-level code to walk a set of paths and find duplicate groups.

Parameters:
Returns:

A list of groups of files with identical contents

Return type:

[[path, ...], [path, ...]]

fastdupes.getPaths(roots, ignores=None)[source]

Recursively walk a set of paths and return a listing of contained files.

Parameters:
  • roots (list of str) – Relative or absolute paths to files or folders.
  • ignores (list of str) – A list of fnmatch globs to avoid walking and omit from results
Returns:

Absolute paths to only files.

Return type:

list of str

Todo

Try to optimize the ignores matching. Running a regex on every filename is a fairly significant percentage of the time taken according to the profiler.

fastdupes.groupBy(groups_in, classifier, fun_desc='?', keep_uniques=False, *args, **kwargs)[source]

Subdivide groups of paths according to a function.

Parameters:
  • groups_in (dict of iterables) – Grouped sets of paths.
  • classifier (function(list, *args, **kwargs) -> str) – Function to group a list of paths by some attribute.
  • fun_desc (str) – Human-readable term for what the classifier operates on. (Used in log messages)
  • keep_uniques (bool) – If False, discard groups with only one member.
Returns:

A dict mapping classifier keys to groups of matches.

Return type:

dict

Attention:

Grouping functions generally use a set groups as extra protection against accidentally counting a given file twice. (Complimentary to use of os.path.realpath() in getPaths())

Todo

Find some way to bring back the file-by-file status text

fastdupes.groupByContent(paths)[source]

Byte-for-byte comparison on an arbitrary number of files in parallel.

This operates by opening all files in parallel and comparing chunk-by-chunk. This has the following implications:

  • Reads the same total amount of data as hash comparison.
  • Performs a lot of disk seeks. (Best suited for SSDs)
  • Vulnerable to file handle exhaustion if used on its own.
Parameters:paths (iterable) – List of potentially identical files.
Returns:A dict mapping one path to a list of all paths (self included) with the same contents.

Todo

Start examining the while handles: block to figure out how to minimize thrashing in situations where read-ahead caching is active. Compare savings by read-ahead to savings due to eliminating false positives as quickly as possible. This is a 2-variable min/max problem.

Todo

Look into possible solutions for pathological cases of thousands of files with the same size and same pre-filter results. (File handle exhaustion)

fastdupes.groupify(function)[source]

Decorator to convert a function which takes a single value and returns a key into one which takes a list of values and returns a dict of key-group mappings.

Parameters:function (function(value) -> key) – A function which takes a value and returns a hash key.
Return type:
function(iterable) ->
    {key: set ([value, ...]), ...}
fastdupes.hashClassifier(paths, *args, **kwargs)[source]

Sort a file into a group based on its SHA1 hash.

Parameters:
  • paths – See fastdupes.groupify()
  • limit (__builtins__.int) – Only this many bytes will be counted in the hash. Values which evaluate to False indicate no limit.
Returns:

See fastdupes.groupify()

fastdupes.hashFile(handle, want_hex=False, limit=None, chunk_size=65536)[source]

Generate a hash from a potentially long file. Digesting will obey CHUNK_SIZE to conserve memory.

Parameters:
  • handle – A file-like object or path to hash from.
  • want_hex (bool) – If True, returned hash will be hex-encoded.
  • limit (int) – Maximum number of bytes to read (rounded up to a multiple of CHUNK_SIZE)
  • chunk_size (int) – Size of read() operations in bytes.
Return type:

str

Returns:

A binary or hex-encoded SHA1 hash.

Note

It is your responsibility to close any file-like objects you pass in

fastdupes.main()[source]

The main entry point, compatible with setuptools.

fastdupes.multiglob_compile(globs, prefix=False)[source]

Generate a single “A or B or C” regex from a list of shell globs.

Parameters:
  • globs (iterable of str) – Patterns to be processed by fnmatch.
  • prefix (bool) – If True, then match() will perform prefix matching rather than exact string matching.
Return type:

re.RegexObject

fastdupes.print_defaults()[source]

Pretty-print the contents of DEFAULTS

fastdupes.pruneUI(dupeList, mainPos=1, mainLen=1)[source]

Display a list of files and prompt for ones to be kept.

The user may enter all or one or more numbers separated by spaces and/or commas.

Note

It is impossible to accidentally choose to keep none of the displayed files.

Parameters:
  • dupeList (list) – A list duplicate file paths
  • mainPos (int) – Used to display “set X of Y”
  • mainLen (int) – Used to display “set X of Y”
Returns:

A list of files to be deleted.

Return type:

int

fastdupes.sizeClassifier(paths, *args, **kwargs)[source]

Sort a file into a group based on on-disk size.

Parameters:
  • paths – See fastdupes.groupify()
  • min_size (__builtins__.int) – Files smaller than this size (in bytes) will be ignored.
Returns:

See fastdupes.groupify()

Todo

Rework the calling of stat() to minimize the number of calls. It’s a fairly significant percentage of the time taken according to the profiler.