Tesseract OCR
Download & Install Tesseract
Visit the Tesseract at UB Mannheim
Select the tesseract-ocr-w64-setup-v5.2.x.x.exe (64 bit) file to download the Tesseract executable installer
Once downloaded, open the executable file and follow the installation prompts
Make sure you have installed the tesseract-64bit in C:\Program Files\Tesseract-OCR
Trained Data Files (Languages)
You can download the .traineddata
file for the language you need and place it in Tesseract OCR installation directory C:\Program Files\Tesseract-OCR\tessdata
\[here]
(this should be the same as where the tessdata directory is installed)
tessdata https://github.com/tesseract-ocr/tessdata Speed : Faster than tessdata-best Accuracy : Slightly less accurate than tessdata-best
tessdata-best
(Recommended for video games)
https://github.com/tesseract-ocr/tessdata_best Speed : Slowest Accuracy : Most accurate
tessdata-fast https://github.com/tesseract-ocr/tessdata_fast Speed : Fastest Accuracy : Least accurate
Page Segmentation Modes
The PSM allows you to select a segmentation method dependent on your particular image and the environment in which it was captured
1
Orientation and script detection (OSD) only.
2
Automatic page segmentation with OSD.
3
Automatic page segmentation, but no OSD, or OCR. (not implemented)
4
Fully automatic page segmentation, but no OSD. (Default)
5
Assume a single column of text of variable sizes.
6
Assume a single uniform block of vertically aligned text.
7
Assume a single uniform block of text.
8
Treat the image as a single text line.
9
Treat the image as a single word.
10
Treat the image as a single word in a circle.
11
Treat the image as a single character.
12
Sparse text. Find as much text as possible in no particular order.
13
Sparse text with OSD.
14
Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
The number one reason I see budding OCR practitioners fail to obtain the correct OCR result is that they are using the incorrect page segmentation mode. To quote the Tesseract documentation, by default, Tesseract expects a page of text when it segments an input image (Improving the quality of the output).
That “page of text” assumption is so incredibly important. If you’re OCR’ing a scanned chapter from a book, the default Tesseract PSM may work well for you. But if you’re trying to OCR only a single line, a single word, or maybe even a single character, then this default mode will result in either an empty string or nonsensical results.
Troubleshooting
TESSDATA_PREFIX is not set to your tessdata directory
Run Command Prompt as administrator
type
setx TESSDATA_PREFIX "C:\Program Files\Tesseract-OCR\tessdata"
, and then press EnterRestart OS
Last updated