Tesseract OCR
Last updated
Last updated
Visit the Tesseract at UB Mannheim
Select the tesseract-ocr-w64-setup-v5.3.x.exe (64 bit) file to download the Tesseract executable installer
Once downloaded, open the executable file and follow the installation prompts
You can download the .traineddata
file for the language you need and place it in Tesseract OCR installation directory C:\Program Files\Tesseract-OCR\tessdata
\[here]
(this should be the same as where the tessdata directory is installed)
tessdata https://github.com/tesseract-ocr/tessdata Speed : Faster than tessdata-best Accuracy : Slightly less accurate than tessdata-best
tessdata-best
(Recommended for video games)
https://github.com/tesseract-ocr/tessdata_best Speed : Slowest Accuracy : Most accurate
tessdata-fast https://github.com/tesseract-ocr/tessdata_fast Speed : Fastest Accuracy : Least accurate
The PSM allows you to select a segmentation method dependent on your particular image and the environment in which it was captured
1
Orientation and script detection (OSD) only.
2
Automatic page segmentation with OSD.
3
Automatic page segmentation, but no OSD, or OCR. (not implemented)
4
Fully automatic page segmentation, but no OSD. (Default)
5
Assume a single column of text of variable sizes.
6
Assume a single uniform block of vertically aligned text.
7
Assume a single uniform block of text.
8
Treat the image as a single text line.
9
Treat the image as a single word.
10
Treat the image as a single word in a circle.
11
Treat the image as a single character.
12
Sparse text. Find as much text as possible in no particular order.
13
Sparse text with OSD.
14
Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
The number one reason I see budding OCR practitioners fail to obtain the correct OCR result is that they are using the incorrect page segmentation mode. To quote the Tesseract documentation, by default, Tesseract expects a page of text when it segments an input image (Improving the quality of the output).
That “page of text” assumption is so incredibly important. If you’re OCR’ing a scanned chapter from a book, the default Tesseract PSM may work well for you. But if you’re trying to OCR only a single line, a single word, or maybe even a single character, then this default mode will result in either an empty string or nonsensical results.