-
Description:
Hello,
I am looking for an experienced Python developer to create a file-sorting script. I have a massive archive (approx. 3TB) of research documents stored on an external hard drive. I need a script that reads the internal text content of these files and automatically copies them into 6 specific folders based on predefined Korean keywords.
[Key Requirements]
Target File Types: .txt, .docx, .pdf, and .hwp (Korean Word Processor). The script must be able to extract text from these formats. Support for Korean encoding (UTF-8, EUC-KR) is mandatory.
Content-Based Sorting: The script must read the body text of the files, not just the file names.
Keyword Mapping: I have 6 categories, each with a specific list of Korean keywords. (e.g., Category 1 keywords: “음양오행”, “조후”, etc.)
Action: The script should COPY (not move/delete) the matching files to a designated “Sorted” folder structure.
Cross-Referencing: If a single document contains keywords belonging to both Category 1 and Category 4, the file must be copied into both folders.
Error Handling & Performance: Since the volume is 3TB, the script must be robust. If a file is corrupted or unreadable, the script should simply skip it, log the error, and continue running without crashing.
[Deliverables]
No-Code Setup: I am not a programmer. The final deliverable MUST be a standalone Windows executable (.exe) OR a very simple one-click batch file (.bat). I cannot set up complex coding environments.
Config File: Please include a simple configuration file (like config.txt or keywords.json) so I can easily update or add new keywords later without modifying the code.
Clear Instructions: A brief, step-by-step English manual on how to run it.
[Skills Required]
Python
Text Extraction (PyPDF2, python-docx, olefile/pyhwp for .hwp)
File & Data Management
Experience with CJK (Korean) text encoding is a big plus.
If you understand these requirements, please start your proposal with the word “Sorted” so I know you read the description. Please provide your estimated time of delivery and a brief explanation of how you plan to handle the .hwp and .pdf text extraction.