Skip to main content

How Duplicate Detection Works

Sorty identifies duplicate files using content-based hashing (SHA-256), not just filename matching. This means:
  • Files with identical content are grouped, regardless of name
  • Renamed files are detected as duplicates
  • Files in different locations are found
  • Safe deletion with recovery options
Duplicates are found by computing a cryptographic hash of file contents, ensuring 100% accuracy.

Detection Methods

Sorty offers three comparison methods:
Most AccurateComputes SHA-256 hash of entire file contents:
public static func computeSHA256(for url: URL) -> String? {
    guard let data = try? Data(contentsOf: url) else { return nil }
    let hash = SHA256.hash(data: data)
    return hash.compactMap { String(format: "%02x", $0) }.joined()
}
Pros:
  • 100% accurate
  • Detects renamed files
  • Cryptographically secure
Cons:
  • Slower for large files
  • CPU-intensive
Use Exact mode for critical files. Use Fast mode for quick scans of large directories.

Duplicate Groups

Files are grouped by hash:
public struct DuplicateGroup: Identifiable {
    public let id: UUID
    public let hash: String
    public let files: [FileItem]
    public let totalSize: Int64
    public let potentialSavings: Int64 // Size - one copy
    
    public var duplicateCount: Int {
        max(0, files.count - 1)
    }
}

Example Group

Group: 3 files (2 duplicates)
Hash: a7f3b9c8...
Total Size: 15.2 MB
Potential Savings: 10.1 MB (keeping 1, removing 2)

Files:
  1. /Users/me/Downloads/photo.jpg (5.1 MB) ← Oldest
  2. /Users/me/Desktop/photo.jpg (5.1 MB)
  3. /Users/me/Photos/photo.jpg (5.1 MB)

Semantic Duplicates

In addition to exact matches, Sorty can detect semantic duplicates (similar but not identical):
public struct SemanticDuplicateGroup: Identifiable {
    public let files: [FileItem]
    public let similarity: Double // 0.0-1.0
    public let groupType: GroupType
    public let recommendation: DuplicateRecommendation
}

public enum GroupType: String {
    case nearDuplicate = "Near Duplicate"
    case versionedFile = "Versioned File"
    case resizedImage = "Resized Image"
    case reencoded = "Re-encoded"
}

Detection Methods

Detects images that are the same but different resolutions:
  • Compare aspect ratios
  • Check EXIF similarity
  • Visual similarity (if vision mode enabled)
Example:
photo_original.jpg (4032x3024)
photo_thumbnail.jpg (800x600)
Similarity: 95%
Identifies file versions by name patterns:
  • file_v1.pdf, file_v2.pdf
  • report_draft.docx, report_final.docx
  • design_2025-01-01.psd, design_2025-01-15.psd
Recommendation: Keep newest or largest
Detects media files with similar content but different encoding:
  • Compare duration (videos/audio)
  • Check bitrate differences
  • Detect transcoded files
Example:
song.flac (30 MB, lossless)
song.mp3 (5 MB, 320kbps)
Similarity: 88%
Finds files with minor differences:
  • Slight edits
  • Cropped images
  • Compressed versions
Requires manual review.

Semantic Similarity Threshold

Configure in Settings → Duplicates:
public var semanticSimilarityThreshold: Double = 0.85 // 85% similar
Lower threshold = more matches (higher false positive rate). Higher threshold = fewer matches (more conservative).

Unified Duplicate Groups

Exact and semantic duplicates are presented together:
public enum UnifiedDuplicateGroup: Identifiable {
    case exact(DuplicateGroup)
    case semantic(SemanticDuplicateGroup)
    
    public var confidenceLevel: ConfidenceLevel {
        switch self {
        case .exact:
            return .high
        case .semantic(let group):
            if group.similarity >= 0.98 { return .high }
            else if group.similarity >= 0.90 { return .medium }
            else { return .low }
        }
    }
}

public enum ConfidenceLevel: String {
    case high = "Safe to Merge"
    case medium = "Review Suggested"
    case low = "Manual Review"
}

Safe Deletion

When enabled (recommended), “deleted” duplicates aren’t immediately removed:
1

Mark for Deletion

Files are flagged but not deleted from disk.
2

Track in History

Deletion is recorded in organization history.
3

Restore if Needed

Go to History → find the cleanup session → click Restore.
4

Confirm Deletion

Only after confirmation are files permanently removed.
Disabling Safe Deletion means files are immediately sent to Trash and cannot be recovered through Sorty.

Bulk Operations

Quick actions for managing duplicates:

Delete All (Keep Newest)

Keeps the most recently modified file:
let newest = group.files.max { ($0.modificationDate ?? .distantPast) < ($1.modificationDate ?? .distantPast) }
let toDelete = group.files.filter { $0.id != newest?.id }

Delete All (Keep Oldest)

Keeps the original (oldest) file:
let oldest = group.files.min { ($0.creationDate ?? .distantFuture) < ($1.creationDate ?? .distantFuture) }
let toDelete = group.files.filter { $0.id != oldest?.id }

Delete All (Keep Largest)

Keeps the file with the largest size (e.g., highest quality image):
let largest = group.files.max { $0.size < $1.size }
let toDelete = group.files.filter { $0.id != largest?.id }

Custom Selection

Manually select which files to keep/delete:
  1. Review each duplicate group
  2. Select files to delete (checkboxes)
  3. Click Delete Selected

Duplicate Detection Manager

Manages the scanning process:
@MainActor
public class DuplicateDetectionManager: ObservableObject {
    @Published public var state: DuplicateScanState = .idle
    @Published public var duplicateGroups: [DuplicateGroup] = []
    @Published public var semanticGroups: [SemanticDuplicateGroup] = []
    @Published public var scanProgress: Double = 0
    
    public var totalDuplicates: Int {
        duplicateGroups.reduce(0) { $0 + $1.duplicateCount }
    }
    
    public var potentialSavings: Int64 {
        duplicateGroups.reduce(0) { $0 + $1.potentialSavings }
    }
}

Scan Process

1

Preparing

Initialize scan state, clear previous results.
2

Computing Hashes

Calculate SHA-256 for each file:
for i in 0..<files.count {
    if files[i].sha256Hash == nil {
        files[i].sha256Hash = HashUtility.computeSHA256(
            for: URL(fileURLWithPath: files[i].path)
        )
    }
    scanProgress = Double(i + 1) / Double(total)
}
3

Grouping

Group files by hash:
var hashGroups: [String: [FileItem]] = [:]
for file in files {
    guard let hash = file.sha256Hash else { continue }
    hashGroups[hash, default: []].append(file)
}

let duplicates = hashGroups
    .filter { $0.value.count > 1 }
    .map { DuplicateGroup(hash: $0.key, files: $0.value) }
    .sorted { $0.potentialSavings > $1.potentialSavings }
4

Semantic Analysis (Optional)

If enabled, run semantic duplicate detection:
if settings.includeSemanticDuplicates {
    let semanticDetector = SemanticDuplicateDetector(
        similarityThreshold: settings.normalizedSemanticSimilarityThreshold
    )
    semanticGroups = await semanticDetector.findSemanticDuplicates(in: files)
}
5

Complete

Update state, display results.

Scan Settings

public struct DuplicateSettings: Codable {
    public var comparisonMethod: ComparisonMethod = .exact
    public var includeSemanticDuplicates: Bool = false
    public var semanticSimilarityThreshold: Int = 85 // 0-100
    public var safeDeletion: Bool = true
    public var scanHiddenFiles: Bool = false
}

public enum ComparisonMethod: String, Codable {
    case exact = "Exact (SHA-256)"
    case fast = "Fast (Name + Size)"
    case metadata = "Metadata (Name + Size + Date)"
}

Performance Optimization

Hash Caching

Hashes are cached in FileItem.sha256Hash:
if files[i].sha256Hash == nil {
    files[i].sha256Hash = computeHash()
}
Re-scanning the same directory uses cached hashes, making subsequent scans much faster.

Incremental Progress

UI updates are yielded periodically:
if i % 10 == 0 {
    await Task.yield() // Let UI update
}

Cancellation Support

Scans can be cancelled mid-process:
if Task.isCancelled {
    isScanning = false
    state = .idle
    return
}

Potential Savings Display

Formatted savings with human-readable units:
public var formattedSavings: String {
    ByteCountFormatter.string(fromByteCount: potentialSavings, countStyle: .file)
}

// Examples:
// 1.5 GB
// 245.3 MB
// 12.8 KB

Integration with Organization

Duplicate detection runs automatically during organization:
private func duplicateDetectionPhase(files: [FileItem]) async throws -> ([FileItem], String) {
    updateProgress(0.21, stage: "Checking for duplicates...")
    
    let detector = DuplicateDetector()
    var updatedFiles = files
    if updatedFiles.contains(where: { $0.sha256Hash == nil }) {
        await detector.computeHashes(for: &updatedFiles)
    }
    
    let duplicates = await detector.findDuplicates(in: updatedFiles)
    await MainActor.run {
        self.detectedDuplicates = duplicates
    }
    
    if aiConfig?.detectDuplicates ?? true {
        return (updatedFiles, PromptContextHelper.duplicateContext(from: duplicates))
    }
    
    return (updatedFiles, "")
}

Duplicate Context in AI Prompt

Duplicates are included in the AI’s organization context:
DUPLICATE FILES DETECTED:
- 5 groups of duplicate files found
- Total potential savings: 2.3 GB

Group 1: photo.jpg (3 copies)
  /Downloads/photo.jpg
  /Desktop/photo.jpg
  /Photos/photo.jpg

Consider organizing these files together and removing duplicates.

Duplicate Handling Strategies

Leave all copies where they are, flag for manual review.

CLI Commands

# Scan for duplicates
sorty duplicates /path/to/scan

# Auto-start scan
sorty duplicates /path/to/scan --auto

# Use specific comparison method
sorty duplicates /path/to/scan --method exact
sorty duplicates /path/to/scan --method fast
DeeplinkDescription
sorty://duplicatesOpen duplicates view
sorty://duplicates?path=/Users/me/DownloadsScan specific path
sorty://duplicates?path=/Users/me/Downloads&autostart=trueAuto-start scan

Workspace Health

Monitor directory health and clutter

File Organization

AI-powered intelligent organization

The Learnings

Learn from your organization habits