pytesseract results different from tesseract command line results

tesseract python-tesseract pytesser

756 观看


118 作者的声誉

I am trying to convert a scanned page to text using both pytesseract and tesseract command line on Ubuntu. The results are remarkably different (pytesseract performs way better than tesseract command line) and I am unable to understand why. I looked at the default values for the parameters and tried altering some of the parameter values in tesseract command line (like psm ) but I am unable to get the same result as pytesseract. Due to lack of proper documentation in pytesseract I am not able to figure out what default values for parameters are used.

Here is my pytesseract code print(pytesseract.image_to_string('test.tiff'))

作者: randomSampling 的来源 发布者: 2017 年 12 月 27 日

回应 1


118 作者的声誉


Looking at the source code of pytesseract, it seems the image is always converted into a .bmp file. Working with a .bmp file and psm of 6 at the command line with Tesseract gives same result as pytesseract. Also, tesseract can work with uncompressed bmp files only. Hence, if ImageMagick is used to convert .pdf to .bmp, the following will work

convert -density 300 -quality 100 mypdf.pdf BMP3:mypdf.bmp
tesseract mypdf.bmp -psm 6 mypdf txt
作者: randomSampling 发布者: 2017 年 12 月 31 日