Post-processing
Post-processing refines the OCR output after the text has been recognized. This step helps correct common OCR errors, remove unwanted characters, and format the text properly before translation.
When to Use Post-processing
Use post-processing when:
OCR recognizes wrong characters consistently ("l" as "|", "0" as "O")
You need to remove specific characters or symbols
Text formatting needs adjustment. (line breaks, quotation marks)
You want to standardize character patterns
OCR output contains unwanted characters
Regular Expression (RegExp)
Regular Expressions (RegExp) are patterns used to search and manipulate text. VNTranslator supports two types of RegExp operations:
1. RegExp Matching
Identifies and extracts specific text patterns from the OCR output. Only text that matches the pattern will be kept.
Use cases:
Extract only Japanese characters and ignore other symbols
Keep only specific language characters
Remove everything except the main dialogue text
Example:
This pattern matches and extracts only Japanese characters (Kanji, Hiragana, Katakana, and Japanese symbols).
["[一-龠]+|[ぁ-ゔ]+|[ァ-ヴー]+|[々〆〤]+|[⺀-⿕]+|[、-〿]+|[ㇰ-ㇿ㈠-㉃㊀-㍿]+", "gmu"]
For more details, see RegExp Matching.
2. RegExp Replacement (Search & Replace)
Searches for specific text patterns and replaces them with other text. This is the most commonly used post-processing technique.
Use cases:
Fix common OCR recognition errors
Replace wrong quotation marks with correct ones
Remove unwanted characters or symbols
Normalize text formatting
Fix line breaks and spacing issues
Common Examples:
Replace quotation marks:
["『", "g", "「"]
["』", "g", "」"]
Remove music symbols:
["♪", "g", ""]
Fix ellipsis:
["。。。", "g", "..."]
Remove line breaks:
["(\r\n|\n|\r)", "gm", " "]
Fix common OCR errors:
["\\|", "g", "I"]
For more details, see RegExp Replacement.