CSC/ECE 517 Spring 2015/oss S1501 OA: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
(Created page with "Integrating OCR")
 
 
(34 intermediate revisions by the same user not shown)
Line 1: Line 1:
Integrating OCR
<font size="5"><b>S1501: Integrating OCR</b></font>
 
This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.
 
__TOC__
 
 
== Introduction to Sahana ==
 
The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].
 
== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :
 
* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging
Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.
 
== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.  As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.
 
<pre>
  Severity of Damage (1-lowest 5-highest):
    1: □  2: □  3: □  4: □  5: □
</pre>
If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.
 
== Goals of Project ==
 
* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.
 
= OCR in Sahana =
 
OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml<ref>[http://lxml.de/ python-lxml]</ref>
* python-imaging (PIL)<ref>[http://www.pythonware.com/products/pil/ python-imaging (PIL)]</ref>
* python-reportlab<ref>[https://pypi.python.org/pypi/reportlab python-reportlab]</ref>
* Imagemagick 'convert'<ref>[http://www.imagemagick.org/script/convert.php Imagemagick]</ref>
* Tesseract 3.00-1<ref>[http://en.wikipedia.org/wiki/Tesseract_%28software%29 Tesseract]</ref>
 
The functionality of OCR in Sahana can be divided broadly in two important parts
 
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.
 
<p>
[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]
</p>
 
The above image shows the flow control that takes place to generate a pdf from forms.
 
<pre>
def newOCRForm(self,
                  formUUID,
                  pdfname="ocrform.pdf",
                  top=65,
                  left=50,
                  bottom=None,
                  right=None,
                  **args):
 
        self.content = []
        self.output = StringIO()
        self.layoutEtree = etree.Element("s3ocrlayout")
</pre>
 
This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.
 
==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.
 
<p>
[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]
</p>
 
This above image shows the flow control when the image data is converted text data and then is populated in the database
 
<pre>
class S3OCRImageParser(object):
    """
        Image Parsing and OCR Utility
    """
 
    def __init__(self, s3method, r):
        """
            Intialise class instance with environment variables and functions
        """
 
        self.r = r
        self.request = current.request
        checkDependencies(r)
</pre>
 
This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.
 
= OCR module Integrated =
 
<p>
[[File:OCRModuleInclusion.png|center]]
</p>
 
The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.
 
= References =
<references/>

Latest revision as of 01:13, 24 March 2015

S1501: Integrating OCR

This page explains about the project to integrate the Optical Character Recognition(OCR)in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.


Introduction to Sahana

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation. The code for the Sahana Eden project is present at Github.

Eden

Sahana Emergency Development Environment (EDEN)<ref>Sahana Eden</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

  • Organization Registry
  • Shelter
  • Project Tracking.
  • Inventory
  • Assets
  • Assessments
  • Scenarios & Events
  • Mapping
  • Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

What is OCR?

The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

 
  Severity of Damage (1-lowest 5-highest):
    1: □   2: □   3: □   4: □   5: □

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

Goals of Project

  • The Goal of project is to integrate OCR implementation present in the project.
  • Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

OCR in Sahana

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py . OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are

The functionality of OCR in Sahana can be divided broadly in two important parts

Form Generation

In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.

Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1

The above image shows the flow control that takes place to generate a pdf from forms.

def newOCRForm(self,
                   formUUID,
                   pdfname="ocrform.pdf",
                   top=65,
                   left=50,
                   bottom=None,
                   right=None,
                   **args):

        self.content = []
        self.output = StringIO()
        self.layoutEtree = etree.Element("s3ocrlayout")

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

Implementation Training Module and OCR

The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement. Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.

Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1

This above image shows the flow control when the image data is converted text data and then is populated in the database

class S3OCRImageParser(object):
    """
        Image Parsing and OCR Utility
    """

    def __init__(self, s3method, r):
        """
            Intialise class instance with environment variables and functions
        """

        self.r = r
        self.request = current.request
        checkDependencies(r)

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

OCR module Integrated

The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

References

<references/>