Tesseract OCR
Last updated
Last updated
Visit the Tesseract at UB Mannheim
Select the tesseract-ocr-w64-setup-v5.3.x.exe (64 bit) file to download the Tesseract executable installer
Once downloaded, open the executable file and follow the installation prompts
Make sure you have installed the tesseract-64bit in C:\Program Files\Tesseract-OCR
You can download the .traineddata
file for the language you need and place it in Tesseract OCR installation directory C:\Program Files\Tesseract-OCR\tessdata
\[here]
(this should be the same as where the tessdata directory is installed)
tessdata https://github.com/tesseract-ocr/tessdata Speed : Faster than tessdata-best Accuracy : Slightly less accurate than tessdata-best
tessdata-best
(Recommended for video games)
https://github.com/tesseract-ocr/tessdata_best Speed : Slowest Accuracy : Most accurate
tessdata-fast https://github.com/tesseract-ocr/tessdata_fast Speed : Fastest Accuracy : Least accurate
The PSM allows you to select a segmentation method dependent on your particular image and the environment in which it was captured
Page segmentation modes | |
---|---|
1 | Orientation and script detection (OSD) only. |
2 | Automatic page segmentation with OSD. |
3 | Automatic page segmentation, but no OSD, or OCR. (not implemented) |
4 | Fully automatic page segmentation, but no OSD. (Default) |
5 | Assume a single column of text of variable sizes. |
6 | Assume a single uniform block of vertically aligned text. |
7 | Assume a single uniform block of text. |
8 | Treat the image as a single text line. |
9 | Treat the image as a single word. |
10 | Treat the image as a single word in a circle. |
11 | Treat the image as a single character. |
12 | Sparse text. Find as much text as possible in no particular order. |
13 | Sparse text with OSD. |
14 | Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific. |
The number one reason I see budding OCR practitioners fail to obtain the correct OCR result is that they are using the incorrect page segmentation mode. To quote the Tesseract documentation, by default, Tesseract expects a page of text when it segments an input image (Improving the quality of the output).
That “page of text” assumption is so incredibly important. If you’re OCR’ing a scanned chapter from a book, the default Tesseract PSM may work well for you. But if you’re trying to OCR only a single line, a single word, or maybe even a single character, then this default mode will result in either an empty string or nonsensical results.