Posted by: Dhammañāṇa
« on: February 02, 2019, 12:12:07 PM »OCR pdf into text Khmer
http://rnd.niptict.edu.kh/ocr/index.php#
(it seems to work fine for certain fonts, not all)
http://rnd.niptict.edu.kh/ocr/index.php#
Quote from: About
Khmer OCR is an Optical Character Recognition system for Khmer font which is one research project of Research and Development Center at National Institute of Posts, Telecoms, and ICT (NIPTICT). OCR is an important software tool for Khmer language in collection and compilation legacy documents most of which had been printed without or lost their digital files, and some were hand written documents. OCR can be used to convert such that documents into computer digital files as archived documents for any purpose in the future.
Many different Khmer fonts have been using everyday, we have developed Khmer OCR and can recognized several fonts such as:
Khmer Angulileka
Khmer S1 or Limon S1
Khmer M1 or Limon R1
Khmer Kep
Khmer OS Battambang
Khmer OS Siemreap
Approach
We used open source Tesseract OCR Engine for training. (https://code.google.com/p/tesseract-ocr/ ). We applied our rule based cropping method for Khmer language
Related Work
"Khmer OCR " is the first published web-based application that developed by Mr. Danh Hong, who has been working hard on Khmer language in computer field such as developed Khmer unicode fonts. Further more, without his guideline, our research team would not had started this project either.
Project Team
Mr. Rapid Sun, Director of Research and Development Center
Mr. Vichet Chea, Chief office of Research and Development
Mr. Nan Mech, Project assistant
Mr. Nan Mech, Project assistant
Mr. Reaksa Tep, Project assistant
Mr. Kea Sorn, Project assistant
Miss. Sreyhuy Leng, Project assistant
Mr. Vanna Chuon, Project assistant
Mr. Chheang Chorng Loem, Project assistant
Experiment and Evaluation
We used ISRI toolkit (http://isri-ocr-evaluation-tools.googlecode.com/files/ftk-1.0.tar.gz ) to evaluate the accuracy of each OCR model.
Here are close test evaluation result:
NoFontAccuracy (%)1Khmer Angulileka ()76.492
Khmer S1 or Limon S1 ()93.333
Khmer Kep ()96.574
Khmer M1 or Limon R1 ()94.305
Khmer OS Battambang ()93.966
Khmer OS Siemreap ()89.25
Contact us
Mr. Vichet CheaResearcher at Research and Development Center, NIPTICTTel: (+855) 77-657-007Email: Website: www.niptict.edu.
(it seems to work fine for certain fonts, not all)