API Documentation¶
Find Dupes Fast By Stephan Sokolow (ssokolow.com)
A simple script which identifies duplicate files several orders of magnitude more quickly than fdupes by using smarter algorithms.
Todo
Figure out how to do ePyDoc-style grouping here without giving up automodule-level comfort.
-
fastdupes.
CHUNK_SIZE
= 65536¶ Size for chunked reads from file handles
-
fastdupes.
DEFAULTS
= {'min_size': 25, 'exclude': ['*/.svn', '*/.bzr', '*/.git', '*/.hg'], 'delete': False}¶ Default settings used by
optparse
and some functions
-
fastdupes.
HEAD_SIZE
= 16384¶ Limit how many bytes will be read to compare headers
-
class
fastdupes.
OverWriter
(fobj)[source]¶ Bases:
object
Output helper for handling overdrawing the previous line cleanly.
-
write
(text, newline=False)[source]¶ Use
\r
to overdraw the current line with the given text.This function transparently handles tracking how much overdrawing is necessary to erase the previous line when used consistently.
Parameters: - text (
str
) – The text to be outputted - newline (
bool
) – Whether to start a new line and reset the length count.
- text (
-
-
fastdupes.
compareChunks
(handles, chunk_size=65536)[source]¶ Group a list of file handles based on equality of the next chunk of data read from them.
Parameters: - handles – A list of open handles for file-like objects with otentially-identical contents.
- chunk_size – The amount of data to read from each handle every time this function is called.
Returns: Two lists of lists:
- Lists to be fed back into this function individually
- Finished groups of duplicate paths. (including unique files as single-file lists)
Return type: (list, list)
Attention
File handles will be closed when no longer needed
Todo
Discard chunk contents immediately once they’re no longer needed
-
fastdupes.
delete_dupes
(groups, prefer_list=None, interactive=True, dry_run=False)[source]¶ Code to handle the
--delete
command-line option.Parameters: - groups (iterable) – A list of groups of paths.
- prefer_list – A whitelist to be compiled by
multiglob_compile()
and used to skip some prompts. - interactive (
bool
) – IfFalse
, assume the user wants to keep all copies when a prompt would otherwise be displayed. - dry_run (
bool
) – IfTrue
, only pretend to delete files.
Todo
Add a secondary check for symlinks for safety.
-
fastdupes.
find_dupes
(paths, exact=False, ignores=None, min_size=0)[source]¶ High-level code to walk a set of paths and find duplicate groups.
Parameters: - exact (
bool
) – Whether to compare file contents by hash or by reading chunks in parallel. - paths – See
getPaths()
- ignores – See
getPaths()
- min_size – See
sizeClassifier()
Returns: A list of groups of files with identical contents
Return type: [[path, ...], [path, ...]]
- exact (
-
fastdupes.
getPaths
(roots, ignores=None)[source]¶ Recursively walk a set of paths and return a listing of contained files.
Parameters: - roots (
list
ofstr
) – Relative or absolute paths to files or folders. - ignores (
list
ofstr
) – A list offnmatch
globs to avoid walking and omit from results
Returns: Absolute paths to only files.
Return type: list
ofstr
Todo
Try to optimize the ignores matching. Running a regex on every filename is a fairly significant percentage of the time taken according to the profiler.
- roots (
-
fastdupes.
groupBy
(groups_in, classifier, fun_desc='?', keep_uniques=False, *args, **kwargs)[source]¶ Subdivide groups of paths according to a function.
Parameters: - groups_in (
dict
of iterables) – Grouped sets of paths. - classifier (
function(list, *args, **kwargs) -> str
) – Function to group a list of paths by some attribute. - fun_desc (
str
) – Human-readable term for what the classifier operates on. (Used in log messages) - keep_uniques (
bool
) – IfFalse
, discard groups with only one member.
Returns: A dict mapping classifier keys to groups of matches.
Return type: dict
Attention: Grouping functions generally use a
set
groups
as extra protection against accidentally counting a given file twice. (Complimentary to use ofos.path.realpath()
ingetPaths()
)Todo
Find some way to bring back the file-by-file status text
- groups_in (
-
fastdupes.
groupByContent
(paths)[source]¶ Byte-for-byte comparison on an arbitrary number of files in parallel.
This operates by opening all files in parallel and comparing chunk-by-chunk. This has the following implications:
- Reads the same total amount of data as hash comparison.
- Performs a lot of disk seeks. (Best suited for SSDs)
- Vulnerable to file handle exhaustion if used on its own.
Parameters: paths (iterable) – List of potentially identical files. Returns: A dict mapping one path to a list of all paths (self included) with the same contents. Todo
Start examining the
while handles:
block to figure out how to minimize thrashing in situations where read-ahead caching is active. Compare savings by read-ahead to savings due to eliminating false positives as quickly as possible. This is a 2-variable min/max problem.Todo
Look into possible solutions for pathological cases of thousands of files with the same size and same pre-filter results. (File handle exhaustion)
-
fastdupes.
groupify
(function)[source]¶ Decorator to convert a function which takes a single value and returns a key into one which takes a list of values and returns a dict of key-group mappings.
Parameters: function ( function(value) -> key
) – A function which takes a value and returns a hash key.Return type: function(iterable) -> {key:
set
([value, ...]), ...}
-
fastdupes.
hashClassifier
(paths, *args, **kwargs)[source]¶ Sort a file into a group based on its SHA1 hash.
Parameters: - paths – See
fastdupes.groupify()
- limit (
__builtins__.int
) – Only this many bytes will be counted in the hash. Values which evaluate toFalse
indicate no limit.
Returns: - paths – See
-
fastdupes.
hashFile
(handle, want_hex=False, limit=None, chunk_size=65536)[source]¶ Generate a hash from a potentially long file. Digesting will obey
CHUNK_SIZE
to conserve memory.Parameters: - handle – A file-like object or path to hash from.
- want_hex (
bool
) – IfTrue
, returned hash will be hex-encoded. - limit (
int
) – Maximum number of bytes to read (rounded up to a multiple ofCHUNK_SIZE
) - chunk_size (
int
) – Size ofread()
operations in bytes.
Return type: str
Returns: A binary or hex-encoded SHA1 hash.
Note
It is your responsibility to close any file-like objects you pass in
-
fastdupes.
multiglob_compile
(globs, prefix=False)[source]¶ Generate a single “A or B or C” regex from a list of shell globs.
Parameters: Return type:
-
fastdupes.
pruneUI
(dupeList, mainPos=1, mainLen=1)[source]¶ Display a list of files and prompt for ones to be kept.
The user may enter
all
or one or more numbers separated by spaces and/or commas.Note
It is impossible to accidentally choose to keep none of the displayed files.
Parameters: - dupeList (
list
) – A list duplicate file paths - mainPos (
int
) – Used to display “set X of Y” - mainLen (
int
) – Used to display “set X of Y”
Returns: A list of files to be deleted.
Return type: int
- dupeList (
-
fastdupes.
sizeClassifier
(paths, *args, **kwargs)[source]¶ Sort a file into a group based on on-disk size.
Parameters: - paths – See
fastdupes.groupify()
- min_size (
__builtins__.int
) – Files smaller than this size (in bytes) will be ignored.
Returns: Todo
Rework the calling of
stat()
to minimize the number of calls. It’s a fairly significant percentage of the time taken according to the profiler.- paths – See