CSC/ECE 517 Spring 2015/oss S1501 OA

From Expertiza_Wiki
Jump to navigation Jump to search

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

Introduction to Sahana

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

Eden

Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

  • Organization Registry
  • Shelter
  • Project Tracking.
  • Inventory
  • Assets
  • Assessments
  • Scenarios & Events
  • Mapping
  • Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

What is OCR?

The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

 
  Severity of Damage (1-lowest 5-highest):
    1: □   2: □   3: □   4: □   5: □

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

OCR in Sahana

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py . OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are

  • python-lxml
  • python-imaging (PIL)
  • python-reportlab
  • Imagemagick 'convert'
  • Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts

Form Generation

In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.

Implementation Training Module

The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.

OCR

This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.