The following diagram illustrates how the script currently decides whether to call the Gemini API.
graph TD
A[Start: Find PDF Files] --> B[detectFileType: Chapter vs Book vs Article]
B --> C{Check: Already Processed?}
C -- "Matches processedFormatRegex AND No Long Words" --> D[RECORD: SKIPPED_NO_CHANGE]
C -- "Does NOT match Regex OR Has Long Words" --> E[Check: Journal Memory]
E -- "Previous Rename Failed but NewName exists" --> F[RECOVERY: Use Cached Name]
E -- "No previous success or Failure AI" --> G[Read PDF Content]
G --> H[AI COGNITION: Call Gemini API]
H --> I[Format New Filename]
I --> J{Compare: New Name vs Old Name}
J -- "Identical" --> K[RECORD: SKIPPED_NO_CHANGE]
J -- "Different" --> L[EXECUTE RENAME]
F --> L
D --> M[End Process for File]
K --> M
L --> M
Based on your code, there are three primary reasons why the script is re-processing files:
- Regex Mismatch for Articles: Your
processedFormatRegexlooks forTitle_Author_Year. however, yourformatBaseFilenamefor articles adds_J-Journaland_VVolume. The regex does not account for these extra underscores, so it thinks the file is "unprocessed." - Strict "Abnormal Word" Check: The function
hasAbnormallyLongWordchecks if any word (PascalCase split) is over 20 characters. If a scientific title has a long word (e.g., “Pathophysiological”), the script flags it as "broken" and forces a re-analysis. - Lack of Content Hashing: The script relies on the filename to decide if it's done. If the AI changes its mind slightly on a title (due to temperature or model updates), the filename changes, and the system might get caught in a loop or re-process on the next run.
To stop the waste, apply these three changes to your script:
Change the regex in CONFIG to be more inclusive of the Article metadata (Journal and Volume).
// Replace the old regex with this one:
// It now allows for multiple underscore-separated segments after the year
processedFormatRegex: /^(\d{2,}_)?[^_]+_[^_]+_\d{4}(_[^_]+)*\.pdf$/i,Modify processEntity to check if the file was successfully processed in a previous session by looking at the journal before reading the PDF content.
// Inside processEntity (SystemCore class)
const previousAttempt = journalMemory.get(entity.fullPath);
// ADD THIS LOGIC:
if (previousAttempt && previousAttempt.status === 'SUCCESS') {
logger.info(`Skipping: "${entity.name}" was successfully processed in a previous run.`);
return null;
}Scientific and academic titles often contain valid long words. Increase the threshold or exclude specific patterns.
// inside hasAbnormallyLongWord
const MAX_WORD_LENGTH = 30; // Increased from 20 to 30Currently, your NodeCache isn't being used to skip AI calls. You should store the result of an AI analysis indexed by a hash of the file's first 1000 characters.
- Generate a hash of the
content.substring(0, 1000). - Check cache:
if (this.cache.has(hash)) return this.cache.get(hash); - Save to cache: After
cognitionEngine.analyzesucceeds, save thevalidatedDatato the cache.
The most immediate fix to stop the "waste" is to ensure that if a file already matches the naming pattern, the script immediately returns null without ever calling entity.readContent() or cognitionEngine.analyze(). Ensure that looksProcessed is actually returning true for your successfully renamed files by testing your filenames against the regex.