Post-processing

Post-processing refines the OCR output after the text has been recognized. This step helps correct common OCR errors, remove unwanted characters, and format the text properly before translation.

Note: Post-processing is useful for all OCR engine types. Even modern and AI-based OCR engines may produce text that needs formatting or correction.

When to Use Post-processing

Use post-processing when:

  • OCR recognizes wrong characters consistently ("l" as "|", "0" as "O")

  • You need to remove specific characters or symbols

  • Text formatting needs adjustment. (line breaks, quotation marks)

  • You want to standardize character patterns

  • OCR output contains unwanted characters

Regular Expression (RegExp)

Regular Expressions (RegExp) are patterns used to search and manipulate text. VNTranslator supports two types of RegExp operations:

1. RegExp Matching

Identifies and extracts specific text patterns from the OCR output. Only text that matches the pattern will be kept.

Use cases:

  • Extract only Japanese characters and ignore other symbols

  • Keep only specific language characters

  • Remove everything except the main dialogue text

Example:

This pattern matches and extracts only Japanese characters (Kanji, Hiragana, Katakana, and Japanese symbols).

["[一-龠]+|[ぁ-ゔ]+|[ァ-ヴー]+|[々〆〤]+|[⺀-⿕]+|[、-〿]+|[ㇰ-ㇿ㈠-㉃㊀-㍿]+", "gmu"]

For more details, see RegExp Matching.

2. RegExp Replacement (Search & Replace)

Searches for specific text patterns and replaces them with other text. This is the most commonly used post-processing technique.

Use cases:

  • Fix common OCR recognition errors

  • Replace wrong quotation marks with correct ones

  • Remove unwanted characters or symbols

  • Normalize text formatting

  • Fix line breaks and spacing issues

Common Examples:

Replace quotation marks:

["『", "g", "「"]
["』", "g", "」"]

Remove music symbols:

["♪", "g", ""]

Fix ellipsis:

["。。。", "g", "..."]

Remove line breaks:

["(\r\n|\n|\r)", "gm", " "]

Fix common OCR errors:

["\\|", "g", "I"]

For more details, see RegExp Replacement.