Tesseract OCR

Download & Install Tesseract

  • Select the tesseract-ocr-w64-setup-v5.2.x.x.exe (64 bit) file to download the Tesseract executable installer

  • Once downloaded, open the executable file and follow the installation prompts

Make sure you have installed the tesseract-64bit in C:\Program Files\Tesseract-OCR

Trained Data Files (Languages)

You can download the .traineddata file for the language you need and place it in Tesseract OCR installation directory C:\Program Files\Tesseract-OCR\tessdata\[here] (this should be the same as where the tessdata directory is installed)

tessdata https://github.com/tesseract-ocr/tessdata Speed : Faster than tessdata-best Accuracy : Slightly less accurate than tessdata-best

tessdata-best (Recommended for video games) https://github.com/tesseract-ocr/tessdata_best Speed : Slowest Accuracy : Most accurate

tessdata-fast https://github.com/tesseract-ocr/tessdata_fast Speed : Fastest Accuracy : Least accurate

Page Segmentation Modes

The PSM allows you to select a segmentation method dependent on your particular image and the environment in which it was captured

Page segmentation modes

1

Orientation and script detection (OSD) only.

2

Automatic page segmentation with OSD.

3

Automatic page segmentation, but no OSD, or OCR. (not implemented)

4

Fully automatic page segmentation, but no OSD. (Default)

5

Assume a single column of text of variable sizes.

6

Assume a single uniform block of vertically aligned text.

7

Assume a single uniform block of text.

8

Treat the image as a single text line.

9

Treat the image as a single word.

10

Treat the image as a single word in a circle.

11

Treat the image as a single character.

12

Sparse text. Find as much text as possible in no particular order.

13

Sparse text with OSD.

14

Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

The number one reason I see budding OCR practitioners fail to obtain the correct OCR result is that they are using the incorrect page segmentation mode. To quote the Tesseract documentation, by default, Tesseract expects a page of text when it segments an input image (Improving the quality of the output).

That “page of text” assumption is so incredibly important. If you’re OCR’ing a scanned chapter from a book, the default Tesseract PSM may work well for you. But if you’re trying to OCR only a single line, a single word, or maybe even a single character, then this default mode will result in either an empty string or nonsensical results.

Read More https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

Troubleshooting

TESSDATA_PREFIX is not set to your tessdata directory

  • Run Command Prompt as administrator

  • type setx TESSDATA_PREFIX "C:\Program Files\Tesseract-OCR\tessdata", and then press Enter

  • Restart OS

Last updated