Python Script to Auto-Sort 3TB of Documents (PDF, DOCX, TXT, HWP) Based on Korea

  • #3960345
    David Daehyong Hong 73.***.251.52 46

    city and county san francisco

    Description:
    Hello,
    I am looking for an experienced Python developer to create a file-sorting script. I have a massive archive (approx. 3TB) of research documents stored on an external hard drive. I need a script that reads the internal text content of these files and automatically copies them into 6 specific folders based on predefined Korean keywords.
    [Key Requirements]
    Target File Types: .txt, .docx, .pdf, and .hwp (Korean Word Processor). The script must be able to extract text from these formats. Support for Korean encoding (UTF-8, EUC-KR) is mandatory.
    Content-Based Sorting: The script must read the body text of the files, not just the file names.
    Keyword Mapping: I have 6 categories, each with a specific list of Korean keywords. (e.g., Category 1 keywords: “음양오행”, “조후”, etc.)
    Action: The script should COPY (not move/delete) the matching files to a designated “Sorted” folder structure.
    Cross-Referencing: If a single document contains keywords belonging to both Category 1 and Category 4, the file must be copied into both folders.
    Error Handling & Performance: Since the volume is 3TB, the script must be robust. If a file is corrupted or unreadable, the script should simply skip it, log the error, and continue running without crashing.
    [Deliverables]
    No-Code Setup: I am not a programmer. The final deliverable MUST be a standalone Windows executable (.exe) OR a very simple one-click batch file (.bat). I cannot set up complex coding environments.
    Config File: Please include a simple configuration file (like config.txt or keywords.json) so I can easily update or add new keywords later without modifying the code.
    Clear Instructions: A brief, step-by-step English manual on how to run it.
    [Skills Required]
    Python
    Text Extraction (PyPDF2, python-docx, olefile/pyhwp for .hwp)
    File & Data Management
    Experience with CJK (Korean) text encoding is a big plus.
    If you understand these requirements, please start your proposal with the word “Sorted” so I know you read the description. Please provide your estimated time of delivery and a brief explanation of how you plan to handle the .hwp and .pdf text extraction.

    • . 172.***.162.177

      How much will you pay for proposal and final deliverable?