Expertiza_Wiki - User contributions [en]

CSC/ECE 517 Spring 2015 E1527 SWAR

2015-04-08T20:06:05Z

Abhanda3: /* Scope */

E1527. Refactor Autometareviews gem and migration to Web-Service 
= Introduction to Autometareview project <ref>https://github.com/lramach/autometareviews0.1</ref>=
This project is developed as part of Expertiza project <ref>http://wikis.lib.ncsu.edu/index.php/Expertiza</ref>. 
The automated metareview tool identifies the quality of a review using natural language processing and machine learning techniques (completely automated). Feedback is provided to reviewers on the following metrics:
<ol>
<li>Review relevance: This metric tells the reviewer how relevant the review is to the content of the author's submission. Numeric feedback in the scale of 0--1 is provided to indicate a review's relevance. </li>
<li>Review Content Type: This metric identifies whether the review contains 'summative content' -- positive feedback, problem detection content' -- problems identified by reviewers in the author's work or 'advisory content' -- content indicating suggestions or advice provided by reviewers. A numeric feedback on the scale of 0--1 is provided for each content type to indicate whether the review contains that type of content. </li>
<li>Review Coverage: This metric indicates the extent to which a review covers the main points of a submission. Numeric value in the range of 0--1 indicates the coverage of a review. </li>
<li>Plagiarism<ref>http://www.plagiarism.org/plagiarism-101/what-is-plagiarism/</ref>: Indicates the presence of plagiarism in the review text.</li>
<li>Tone: The metric indicates whether a review has a positive, negative or neutral tone. </li>
<li>Quantity: Indicates the number of unique words used by the reviewer in the review. </li>
</ol>
 
__TOC__

= Problem Statement =

Currently, Autometareviews project is used as a gem<ref>http://guides.rubygems.org/what-is-a-gem/</ref> in Expertiza project. Purpose of this project is to migrate this gem to a web service<ref>http://en.wikipedia.org/wiki/Web_service</ref> and expose its methods on web, which can be consumed by any application as web service. Older gem was dependent old libraries<ref>http://en.wikipedia.org/wiki/Library_%28computing%29</ref>
such as Stanford-core-nlp, rwordnet, etc. We will migrate them to new libraries without breaking the existing feature-set. We are also going to refactor the source code of this gem file to promote readability, reduced complexity, and code redundancies. We will fix
any bug or bottleneck that we can find to improve the performance of this service. We will not add any new feature to the existing feature set provided by the gem. Before making any modification to the existing features, we will present them before Dr.
Gehringer and his Expertiza team.

= Scope =
There are three separate scope items in this project -
:* Migration of existing gem application to a web service
:* Refactoring the existing ruby classes
:* Migrating to newer libraries, wherever possible.

The classes that we propose to refactor are
:* tone.rb
:* degree_of_relevance.rb
:* wordnet_based_similarity.rb
:* sentence_state.rb
:* cluster_generation.rb
:* plagiarism_check.rb
:* graph_generator.rb
:* predict_class.rb
:*review_coverage.rb

No new feature will be developed as part of this project. Any major code change due to inclusion of newer libraries will be communicated to Expertiza project team. Existing code will be tested to ensure the functionality does not change.

=Standards =
All developed code will adhere to the ruby on rails coding guidelines<ref>https://docs.google.com/document/d/1qQD7fcypFk77nq7Jx7ZNyCNpLyt1oXKaq5G-W7zkV3k/edit</ref>.

= List of Tasks =
Metioned below are the tasks we will perform as part of this project.<ref>https://docs.google.com/document/d/10JTdEjCiRTre3nO4j_czBzhcxkqOSzmAT8jyiJHNz4c/edit#</ref>

This system is still in nascent stage and have many performance related issues. It takes a long time (about 2 minutes) to generate single meta-review. This is an unacceptable performance statistics for Expertiza. We propose to re-factor code and identify the areas that affect the overall performance of the system. Few areas we identified in preliminary review are: 
:* Reading seed data from csv in each pass takes up a lot of time. We can move this data into Mysql and use ActiveRecords to speed data fetch.
:* WordNet based semantic matching takes a lot of time. We will review the method used and present our finding about areas of concern.

== 1. Refactor Code==
<ol>
<li>Efficient Loop constructs. 
Description: Many loops over models are implemented using generic “for” loops.
Solution: As specified by Ruby guideline, we plan to use efficient ruby loops, such as “each” and “find_each”.
</li><li>Very large methods 
Description: Several methods have huge amount of code, which makes them difficult to understand and debug.
Solution: In most cases, large methods can be shortened through the use of smaller helper methods. Such methods could be reused across different components.
</li><li>Ambiguous method names 
Description: Many methods have ambiguity between the name used for them and the feature implemented by them.
Solution: We will rename such methods to clearly state the feature implemented by them.
</li><li>Legacy Code 
Description: As the system has been modified for bug fixes and enhancements, unnecessary code has accumulated.
Solution: Isolate and remove all dead code.
</li><li>Code beautification 
Description: Coding style used in gem is not based on Ruby on Rails style, which makes it difficult to read for any Ruby programmer.
Solution: Beautify the code with a consistent standard of documentation, and style.
</li></ol>

== 2. Upgrade system to use latest dependent ruby gems ==
The libraries used by gem are very old. We plan to migrate the dependent libraries to their latest versions. 
Libraries, we have identified are: 
:* stanford-core-nlp <ref>http://nlp.stanford.edu/software/corenlp.shtml</ref>
:* rwordnet <ref>https://rubygems.org/gems/rwordnet</ref>
:* rjb <ref>https://rubygems.org/gems/rjb</ref>
:* bind-it<ref>https://rubygems.org/gems/bind-it</ref>

We will also migrate the project to use Java 8<ref>http://java.com/en/download/whatis_java.jsp</ref>.

== 3. Migrate gem to Web service ==
Expertiza system tries to evaluate each review using an automated meta-review system. This system is packaged as a library and used by Expertiza. Automated Metareview system is an independent entity and can be used by other peer review systems as well. This is Natural Language Processing based system that accepts original article, review written for this article, and the rubric used during article review. There are many other peer review systems, which can benefit from this system, if this is available for them to evaluate their rubrics. We are working on migrating this system from a library to a web service.
 Web service will expose "AutomatedMetareview" method, which will accept three parameters mentioned before as JSON and return the meta-review as a JSON object. 

The response JSON object will have parameters mentioned below: 
:* plagiarism
:* relevance
:* content_summative
:* content_problem
:* content_advisory
:* coverage
:* tone_positive
:* tone_negative
:* tone_neutral
:* quantity

[[File:Workflow.jpg|frame|center|Interaction between Client and Web Service]]

===3.1 Assumptions===
For the project, the code that is being modified is assumed to be correct and meet all feature requirements of the system. Interactions modified due to refactoring will not change the underline system definitions.

== 4. Testing==
We will be using the existing test suite used by gem to test any new code modification. We will be writing new test cases for web service implementation and any new public method exposed by existing classes.

= References =
<references/>

CSC/ECE 517 Spring 2015 E1527 SWAR

2015-04-08T19:37:37Z

Abhanda3: Undo revision 96549 by Abhanda3 (talk)

E1527. Refactor Autometareviews gem and migration to Web-Service 
= Introduction to Autometareview project <ref>https://github.com/lramach/autometareviews0.1</ref>=
This project is developed as part of Expertiza project <ref>http://wikis.lib.ncsu.edu/index.php/Expertiza</ref>. 
The automated metareview tool identifies the quality of a review using natural language processing and machine learning techniques (completely automated). Feedback is provided to reviewers on the following metrics:
<ol>
<li>Review relevance: This metric tells the reviewer how relevant the review is to the content of the author's submission. Numeric feedback in the scale of 0--1 is provided to indicate a review's relevance. </li>
<li>Review Content Type: This metric identifies whether the review contains 'summative content' -- positive feedback, problem detection content' -- problems identified by reviewers in the author's work or 'advisory content' -- content indicating suggestions or advice provided by reviewers. A numeric feedback on the scale of 0--1 is provided for each content type to indicate whether the review contains that type of content. </li>
<li>Review Coverage: This metric indicates the extent to which a review covers the main points of a submission. Numeric value in the range of 0--1 indicates the coverage of a review. </li>
<li>Plagiarism<ref>http://www.plagiarism.org/plagiarism-101/what-is-plagiarism/</ref>: Indicates the presence of plagiarism in the review text.</li>
<li>Tone: The metric indicates whether a review has a positive, negative or neutral tone. </li>
<li>Quantity: Indicates the number of unique words used by the reviewer in the review. </li>
</ol>
__TOC__

= Problem Statement =

Currently, Autometareviews project is used as a gem<ref>http://guides.rubygems.org/what-is-a-gem/</ref> in Expertiza project. Purpose of this project is to migrate this gem to a web service<ref>http://en.wikipedia.org/wiki/Web_service</ref> and expose its methods on web, which can be consumed by any application as web service. Older gem was dependent old libraries<ref>http://en.wikipedia.org/wiki/Library_%28computing%29</ref>
such as Stanford-core-nlp, rwordnet, etc. We will migrate them to new libraries without breaking the existing feature-set. We are also going to refactor the source code of this gem file to promote readability, reduced complexity, and code redundancies. We will fix
any bug or bottleneck that we can find to improve the performance of this service. We will not add any new feature to the existing feature set provided by the gem. Before making any modification to the existing features, we will present them before Dr.
Gehringer and his Expertiza team.

= Scope =
The scope of this project includes migration of existing gem application to a web-service, refactoring the existing classes and migrating to newer libraries, wherever possible.
The classes that we propose to refactor are
<ul>
<li>
tone.rb </li><li> degree_of_relevance.rb </li><li> wordnet_based_similarity.rb </li><li> sentence_state.rb </li><li> cluster_generation.rb </li><li> plagiarism_check.rb </li><li> graph_generator.rb </li><li> predict_class.rb </li><li> and review_coverage.rb </li></ul>
No new feature will be developed as part of this project. Any major code change due to inclusion of newer libraries will be communicated to Expertiza project team. Existing code will be tested to ensure the functionality does not change.

=Standards =
All developed code will adhere to the ruby on rails coding guidelines<ref>https://docs.google.com/document/d/1qQD7fcypFk77nq7Jx7ZNyCNpLyt1oXKaq5G-W7zkV3k/edit</ref>.

= List of Tasks =
Metioned below are the tasks we will perform as part of this project.<ref>https://docs.google.com/document/d/10JTdEjCiRTre3nO4j_czBzhcxkqOSzmAT8jyiJHNz4c/edit#</ref>

This system is still in nascent stage and have many performance related issues. It takes a long time (about 2 minutes) to generate single meta-review. This is an unacceptable performance statistics for Expertiza. We propose to re-factor code and identify the areas that affect the overall performance of the system. Few areas we identified in preliminary review are: 
<ul>
<li>
Reading seed data from csv in each pass takes up a lot of time. We can move this data into Mysql and use ActiveRecords to speed data fetch.
</li>
<li>
WordNet based semantic matching takes a lot of time. We will review the method used and present our finding about areas of concern.
</li>
</ul>

== 1. Refactor Code==
<ol>
<li>Efficient Loop constructs. 
Description: Many loops over models are implemented using generic “for” loops.
Solution: As specified by Ruby guideline, we plan to use efficient ruby loops, such as “each” and “find_each”.
</li><li>Very large methods 
Description: Several methods have huge amount of code, which makes them difficult to understand and debug.
Solution: In most cases, large methods can be shortened through the use of smaller helper methods. Such methods could be reused across different components.
</li><li>Ambiguous method names 
Description: Many methods have ambiguity between the name used for them and the feature implemented by them.
Solution: We will rename such methods to clearly state the feature implemented by them.
</li><li>Legacy Code 
Description: As the system has been modified for bug fixes and enhancements, unnecessary code has accumulated.
Solution: Isolate and remove all dead code.
</li><li>Code beautification 
Description: Coding style used in gem is not based on Ruby on Rails style, which makes it difficult to read for any Ruby programmer.
Solution: Beautify the code with a consistent standard of documentation, and style.
</li></ol>

== 2. Upgrade system to use latest dependent ruby gems ==
The libraries used by gem are very old. We plan to migrate the dependent libraries to their latest versions. 
Libraries, we have identified are: 
<ul>
<li>stanford-core-nlp <ref>http://nlp.stanford.edu/software/corenlp.shtml</ref></li>
<li>rwordnet <ref>https://rubygems.org/gems/rwordnet</ref> </li>
<li>rjb <ref>https://rubygems.org/gems/rjb</ref></li>
<li>bind-it<ref>https://rubygems.org/gems/bind-it</ref></li>
</ul>
We will also migrate the project to use Java 8<ref>http://java.com/en/download/whatis_java.jsp</ref>.

== 3. Migrate gem to Web service ==
Expertiza system tries to evaluate each review using an automated meta-review system. This system is packaged as a library and used by Expertiza. Automated Metareview system is an independent entity and can be used by other peer review systems as well. This is Natural Language Processing based system that accepts original article, review written for this article, and the rubric used during article review. There are many other peer review systems, which can benefit from this system, if this is available for them to evaluate their rubrics. We are working on migrating this system from a library to a web service.
 Web service will expose "AutomatedMetareview" method, which will accept three parameters mentioned before as JSON and return the meta-review as a JSON object. 
The response JSON object will have parameters mentioned below: 
<ul>
<li>plagiarism</li>
<li>relevance</li>
<li>content_summative</li>
<li>content_problem</li>
<li>content_advisory</li>
<li>coverage</li>
<li>tone_positive</li>
<li>tone_negative</li>
<li>tone_neutral</li>
<li>quantity</li>
</ul>
[[File:Workflow.jpg|frame|center|Interaction between Client and Web Service]]

===3.1 Assumptions===
For the project, the code that is being modified is assumed to be correct and meet all feature requirements of the system. Interactions modified due to refactoring will not change the underline system definitions.

== 4. Testing==
We will be using the existing test suite used by gem to test any new code modification. We will be writing new test cases for web service implementation and any new public method exposed by existing classes.

= References =
<references/>

CSC/ECE 517 Spring 2015 E1527 SWAR

2015-04-08T19:36:28Z

Abhanda3: /* Introduction to Autometareview project https://github.com/lramach/autometareviews0.1 */

E1527. Refactor Autometareviews gem and migration to Web-Service 
= Introduction to Autometareview project <ref>https://github.com/lramach/autometareviews0.1</ref>=
This project is developed as part of Expertiza project <ref>http://wikis.lib.ncsu.edu/index.php/Expertiza</ref>. 
The automated metareview tool identifies the quality of a review using natural language processing and machine learning techniques (completely automated). Feedback is provided to reviewers on the following metrics:
<ol>
*'''Review relevance:''' This metric tells the reviewer how relevant the review is to the content of the author's submission. Numeric feedback in the scale of 0--1 is provided to indicate a review's relevance. </li>
*'''Review Content Type:''' This metric identifies whether the review contains 'summative content' -- positive feedback, problem detection content' -- problems identified by reviewers in the author's work or 'advisory content' -- content indicating suggestions or advice provided by reviewers. A numeric feedback on the scale of 0--1 is provided for each content type to indicate whether the review contains that type of content. </li>
*'''Review Coverage:''' This metric indicates the extent to which a review covers the main points of a submission. Numeric value in the range of 0--1 indicates the coverage of a review. </li>
*'''Plagiarism<ref>http://www.plagiarism.org/plagiarism-101/what-is-plagiarism/</ref>:''' Indicates the presence of plagiarism in the review text.</li>
*'''Tone:''' The metric indicates whether a review has a positive, negative or neutral tone. </li>
*'''Quantity:''' Indicates the number of unique words used by the reviewer in the review. </li>
</ol>
__TOC__

= Problem Statement =

Currently, Autometareviews project is used as a gem<ref>http://guides.rubygems.org/what-is-a-gem/</ref> in Expertiza project. Purpose of this project is to migrate this gem to a web service<ref>http://en.wikipedia.org/wiki/Web_service</ref> and expose its methods on web, which can be consumed by any application as web service. Older gem was dependent old libraries<ref>http://en.wikipedia.org/wiki/Library_%28computing%29</ref>
such as Stanford-core-nlp, rwordnet, etc. We will migrate them to new libraries without breaking the existing feature-set. We are also going to refactor the source code of this gem file to promote readability, reduced complexity, and code redundancies. We will fix
any bug or bottleneck that we can find to improve the performance of this service. We will not add any new feature to the existing feature set provided by the gem. Before making any modification to the existing features, we will present them before Dr.
Gehringer and his Expertiza team.

= Scope =
The scope of this project includes migration of existing gem application to a web-service, refactoring the existing classes and migrating to newer libraries, wherever possible.
The classes that we propose to refactor are
<ul>
<li>
tone.rb </li><li> degree_of_relevance.rb </li><li> wordnet_based_similarity.rb </li><li> sentence_state.rb </li><li> cluster_generation.rb </li><li> plagiarism_check.rb </li><li> graph_generator.rb </li><li> predict_class.rb </li><li> and review_coverage.rb </li></ul>
No new feature will be developed as part of this project. Any major code change due to inclusion of newer libraries will be communicated to Expertiza project team. Existing code will be tested to ensure the functionality does not change.

=Standards =
All developed code will adhere to the ruby on rails coding guidelines<ref>https://docs.google.com/document/d/1qQD7fcypFk77nq7Jx7ZNyCNpLyt1oXKaq5G-W7zkV3k/edit</ref>.

= List of Tasks =
Metioned below are the tasks we will perform as part of this project.<ref>https://docs.google.com/document/d/10JTdEjCiRTre3nO4j_czBzhcxkqOSzmAT8jyiJHNz4c/edit#</ref>

This system is still in nascent stage and have many performance related issues. It takes a long time (about 2 minutes) to generate single meta-review. This is an unacceptable performance statistics for Expertiza. We propose to re-factor code and identify the areas that affect the overall performance of the system. Few areas we identified in preliminary review are: 
<ul>
<li>
Reading seed data from csv in each pass takes up a lot of time. We can move this data into Mysql and use ActiveRecords to speed data fetch.
</li>
<li>
WordNet based semantic matching takes a lot of time. We will review the method used and present our finding about areas of concern.
</li>
</ul>

== 1. Refactor Code==
<ol>
<li>Efficient Loop constructs. 
Description: Many loops over models are implemented using generic “for” loops.
Solution: As specified by Ruby guideline, we plan to use efficient ruby loops, such as “each” and “find_each”.
</li><li>Very large methods 
Description: Several methods have huge amount of code, which makes them difficult to understand and debug.
Solution: In most cases, large methods can be shortened through the use of smaller helper methods. Such methods could be reused across different components.
</li><li>Ambiguous method names 
Description: Many methods have ambiguity between the name used for them and the feature implemented by them.
Solution: We will rename such methods to clearly state the feature implemented by them.
</li><li>Legacy Code 
Description: As the system has been modified for bug fixes and enhancements, unnecessary code has accumulated.
Solution: Isolate and remove all dead code.
</li><li>Code beautification 
Description: Coding style used in gem is not based on Ruby on Rails style, which makes it difficult to read for any Ruby programmer.
Solution: Beautify the code with a consistent standard of documentation, and style.
</li></ol>

== 2. Upgrade system to use latest dependent ruby gems ==
The libraries used by gem are very old. We plan to migrate the dependent libraries to their latest versions. 
Libraries, we have identified are: 
<ul>
<li>stanford-core-nlp <ref>http://nlp.stanford.edu/software/corenlp.shtml</ref></li>
<li>rwordnet <ref>https://rubygems.org/gems/rwordnet</ref> </li>
<li>rjb <ref>https://rubygems.org/gems/rjb</ref></li>
<li>bind-it<ref>https://rubygems.org/gems/bind-it</ref></li>
</ul>
We will also migrate the project to use Java 8<ref>http://java.com/en/download/whatis_java.jsp</ref>.

== 3. Migrate gem to Web service ==
Expertiza system tries to evaluate each review using an automated meta-review system. This system is packaged as a library and used by Expertiza. Automated Metareview system is an independent entity and can be used by other peer review systems as well. This is Natural Language Processing based system that accepts original article, review written for this article, and the rubric used during article review. There are many other peer review systems, which can benefit from this system, if this is available for them to evaluate their rubrics. We are working on migrating this system from a library to a web service.
 Web service will expose "AutomatedMetareview" method, which will accept three parameters mentioned before as JSON and return the meta-review as a JSON object. 
The response JSON object will have parameters mentioned below: 
<ul>
<li>plagiarism</li>
<li>relevance</li>
<li>content_summative</li>
<li>content_problem</li>
<li>content_advisory</li>
<li>coverage</li>
<li>tone_positive</li>
<li>tone_negative</li>
<li>tone_neutral</li>
<li>quantity</li>
</ul>
[[File:Workflow.jpg|frame|center|Interaction between Client and Web Service]]

===3.1 Assumptions===
For the project, the code that is being modified is assumed to be correct and meet all feature requirements of the system. Interactions modified due to refactoring will not change the underline system definitions.

== 4. Testing==
We will be using the existing test suite used by gem to test any new code modification. We will be writing new test cases for web service implementation and any new public method exposed by existing classes.

= References =
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:13:03Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml<ref>[http://lxml.de/ python-lxml]</ref>
* python-imaging (PIL)<ref>[http://www.pythonware.com/products/pil/ python-imaging (PIL)]</ref>
* python-reportlab<ref>[https://pypi.python.org/pypi/reportlab python-reportlab]</ref>
* Imagemagick 'convert'<ref>[http://www.imagemagick.org/script/convert.php Imagemagick]</ref>
* Tesseract 3.00-1<ref>[http://en.wikipedia.org/wiki/Tesseract_%28software%29 Tesseract]</ref>

The functionality of OCR in Sahana can be divided broadly in two important parts

==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

= OCR module Integrated =


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

= References =
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:09:12Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml<ref>http://lxml.de/</ref>
* python-imaging (PIL)<ref>http://www.pythonware.com/products/pil/</ref>
* python-reportlab<ref>https://pypi.python.org/pypi/reportlab</ref>
* Imagemagick 'convert'<ref>http://www.imagemagick.org/script/convert.php</ref>
* Tesseract 3.00-1<ref>https://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-3.00.1.exe.zip&can=2&q</ref>

The functionality of OCR in Sahana can be divided broadly in two important parts

==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

= OCR module Integrated =


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

= References =
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:06:13Z

Abhanda3: /* References */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts

==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

= OCR module Integrated =


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

= References =
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:05:53Z

Abhanda3: /* OCR module Integrated */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts

==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

= OCR module Integrated =


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

== References ==
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:05:34Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts

==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

== References ==
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:05:08Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

== OCR in Sahana ==

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

== References ==
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:04:30Z

Abhanda3:

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN)<ref>[http://eden.sahanafoundation.org/ Sahana Eden]</ref> is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

== References ==
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:01:03Z

Abhanda3: /* OCR module Integrated */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

== References ==
<references/>

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T01:00:32Z

Abhanda3: /* Implementation Training Module and OCR */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:59:56Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File: Generated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:58:29Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File:chef_knife server.pngGenerated.png|frame|Source: http://eden.sahanafoundation.org/wiki/BluePrint/OCRIntegration#no1|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:56:37Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example, a form pertaining to North Carolina can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville” or "Others". This will help in easy understanding of the data for the OCR.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:54:28Z

Abhanda3: /* Introduction to Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. Sahana Eden is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:54:00Z

Abhanda3:

S1501: Integrating OCR

This page explains about the project to integrate the [http://en.wikipedia.org/wiki/Optical_character_recognition/ Optical Character Recognition(OCR)]in [http://eden.sahanafoundation.org/ Sahana Eden], which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. [http://eden.sahanafoundation.org/ Sahana Eden] is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:51:36Z

Abhanda3: /* Introduction to Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. [http://eden.sahanafoundation.org/ Sahana Eden] is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.
The code for the Sahana Eden project is present at [https://github.com/flavour/eden Github].

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:49:38Z

Abhanda3: /* Introduction to Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The [http://sahanafoundation.org/ Sahana Software Foundation] was established as a non-profit organization in 2009. [http://eden.sahanafoundation.org/ Sahana Eden] is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:47:35Z

Abhanda3: /* OCR module Integrated */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]


The implementation was included in the project by making changes to models/000_config.py & modules/templates/<template>/config.py.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:45:50Z

Abhanda3:

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

== OCR module Integrated ==


[[File:OCRModuleInclusion.png|center]]

File:OCRModuleInclusion.png

2015-03-24T00:44:24Z

Abhanda3:

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:41:22Z

Abhanda3: /* Implementation Training Module and OCR */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]


This above image shows the flow control when the image data is converted text data and then is populated in the database

<pre>
class S3OCRImageParser(object):
"""
Image Parsing and OCR Utility
"""

def __init__(self, s3method, r):
"""
Intialise class instance with environment variables and functions
"""

self.r = r
self.request = current.request
checkDependencies(r)
</pre>

This is the class in which the Parsing and OCR utility is implemented. This again uses Adapter pattern for its design as data from different type of from field of the image data have to converted to text data.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:35:27Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.


[[File:Importflow.png|center]]

File:Importflow.png

2015-03-24T00:34:45Z

Abhanda3: This image shows the flow control when the image data is stored as text data.

This image shows the flow control when the image data is stored as text data.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:33:30Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in two important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:32:54Z

Abhanda3: /* Implementation Training Module */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module and OCR==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
Post this OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence Tesseract was chosen for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:31:15Z

Abhanda3: /* OCR */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:26:53Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

This above code is for form generation. It uses Adapter pattern in its implementation as there are many different types of forms present. So whenever a form has to be generated, it is accordingly tweaked and adapted to generate those.

==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
==OCR==
This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:21:52Z

Abhanda3: /* What is OCR? */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

== Goals of Project ==

* The Goal of project is to integrate OCR implementation present in the project.
* Moving the implementation from modules/s3/s3pdf.py, to s3codecs/pdf.py , as the decode() part of the codec.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
==OCR==
This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:16:45Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

<pre>
def newOCRForm(self,
formUUID,
pdfname="ocrform.pdf",
top=65,
left=50,
bottom=None,
right=None,
**args):

self.content = []
self.output = StringIO()
self.layoutEtree = etree.Element("s3ocrlayout")
</pre>

==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
==OCR==
This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:14:19Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


The above image shows the flow control that takes place to generate a pdf from forms.

==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
==OCR==
This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:12:21Z

Abhanda3: /* Form Generation */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.


[[File:Generated.png|center]]


==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
==OCR==
This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.

File:Generated.png

2015-03-24T00:10:42Z

Abhanda3: This picture shows the flow control for creating the pdf from forms

This picture shows the flow control for creating the pdf from forms

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-24T00:01:14Z

Abhanda3: /* OCR in Sahana */

S1501: Integrating OCR

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

__TOC__

== Introduction to Sahana ==

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

== Eden ==
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :

* Organization Registry
* Shelter
* Project Tracking.
* Inventory
* Assets
* Assessments
* Scenarios & Events
* Mapping
* Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

== What is OCR? ==
The wiki definition of the optical character recognition is “Optical character recognition (OCR) is the mechanical or electronic conversion of images of typewritten or printed text into machine-encoded text.” The main idea is to provide the computer the capability of identifying and storing the data which was hand-written/type-written on a paper. It found wide applications in data entry from printed paper data records such as invoices, computer receipts, passports, bank statements or from any other document. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. As mentioned earlier the main purpose of Sahana Eden is to provide technical support in disaster management. During disasters it may not be always be possible for the authorities to carry computer to the affected areas to manually enter the data and also the area may not have the internet facility. Instead the forms can be printed and can be made to be filled in the area and then scan the images to store in the database. It may not be possible to identify the handwritten images because different people have different fonts. This problem can be overcome in two ways. One way is to use the type-written images and the second one is trying to represent the data that has to be filled in the form of check boxes. For example instead of putting a field “Severity of damage”, we can get the required answer by representing the data as follows.

<pre>
Severity of Damage (1-lowest 5-highest):
1: □ 2: □ 3: □ 4: □ 5: □
</pre>

If one of them is ticked, the computer can identify the tick and can store the data referring to database schema. It may not always be possible to carry a computer/laptop to the affected areas. Hence designing forms which highly rely on check boxes can be a better solution.

= OCR in Sahana =

OCR is implemented in Sahana using tesseract which is a cross platform compatible software and the implementation is present in modules/s3/s3pdf.py .
OCR has few dependencies that have to be installed. The various dependencies and command line tools that have to be installed are
* python-lxml
* python-imaging (PIL)
* python-reportlab
* Imagemagick 'convert'
* Tesseract 3.00-1

The functionality of OCR in Sahana can be divided broadly in three important parts
==Form Generation==
In this step we generate a pdf document that contains most of the general data with corresponding checkboxes. For example a form can contain the field “Lives at” with check boxes for “Raleigh” , “Charlotte”, “Cary”, “Morrisville”.
==Implementation Training Module==
The generated form is then filled by the user. The automated matching is done with the box generated file rather than the image from human involvement.
==OCR==
This is the final stage of the project during which OCR is deployed using the Tesseract software. The choice of software is mainly dependent on the success rate of the software and hence we chose Tesseract for this purpose.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T17:53:28Z

Abhanda3:

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T17:27:30Z

Abhanda3: /* What is OCR? */

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T16:35:02Z

Abhanda3:

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T16:29:30Z

Abhanda3:

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T16:22:54Z

Abhanda3:

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T16:20:13Z

Abhanda3:

'''S1501: Integrating OCR'''

This page explains about the project to integrate the optical character recognition (OCR) in Sahana Eden, which enables the Eden to detect and identify the type-written data on the forms and store it in the database.

'''Introduction to Sahana''' 

The Sahana Software Foundation was established as a non-profit organization in 2009. Sahana is dedicated to saving lives by providing information management solutions which can enable organizations to prepare and respond better to disasters. It was originally developed by Sri Lankan based IT Community who wanted to apply their talent towards helping their country recover from the 2004 Indian Ocean earthquake and Tsunami. It grew into a global open source project which is now supported by hundreds of volunteer contributions from different parts of the world. The community evolved and emerged to become the present Sahana Software Foundation.

''' Eden ''' 
Sahana Emergency Development Environment (EDEN) is a flexible humanitarian platform which can be deployed for critical humanitarian needs management either prior to or during a crisis. This is specifically built for the disaster management. The mission of Sahana Eden is “ To help alleviate human suffering by giving emergency managers, disaster response professionals and communities access to the information that they need to better prepare for and respond to disasters through the development and promotion of free and open source software and open standards. ” Eden has a rich feature set and be rapidly customized to adapt to existing processes and can also be integrated with existing systems. It can be accessed from the web or locally from a flash drive which allows it to be used even in areas with poor internet connectivity. Sahana Eden contains a number of modules which can be configured to provide a wide range of functionality. Some of its main capabilities are :
1. Organization Registry
2. Shelter
3. Project Tracking.
4. Inventory
5. Assets
6. Assessments
7. Scenarios & Events
8. Mapping
9. Messaging

Eden is easy to install and maintain. Probably this is the main reason it has been successfully deployed during many disasters and have been of great help to the disaster management authorities.

CSC/ECE 517 Spring 2015/oss S1501 OA

2015-03-23T16:10:38Z

Abhanda3: Created page with "Integrating OCR"

Integrating OCR

CSC/ECE 517 Spring 2015

2015-03-23T16:10:08Z

Abhanda3:

==Writing Assignment 1==
*[[CSC/ECE 517 Spring 2015/ch1a 17 WL]]
*[[CSC/ECE 517 Spring 2015/ch1a 5 ZX]]
*[[CSC/ECE 517 Spring 2015/ch1a 6 TZ]]
*[[CSC/ECE 517 Spring 2015/ch1a 4 RW]]
*[[CSC/ECE 517 Spring 2015/ch1a 7 SA]]
*[[CSC/ECE 517 Spring 2015/ch1a 9 RA]]
*[[CSC/ECE 517 Spring 2015/ch1a 14 RI]]
*[[CSC/ECE 517 Spring 2015/ch1a 1 DZ]]
*[[CSC/ECE 517 Spring 2015/ch1a 20 HA]]
*[[CSC/ECE 517 Spring 2015/ch1a 3 RF]]
*[[CSC/ECE 517 Spring 2015/ch1a 12 LS]]
*[[CSC/ECE 517 Spring 2015/ch1a 13 MA]]
*[[CSC/ECE 517 Spring 2015/ch1a 2 WA]]
*[[CSC/ECE 517 Spring 2015/ch1b 21 QW]]
*[[CSC/ECE 517 Spring 2015/ch1b 23 MS]]
*[[CSC/ECE 517 Spring 2015/ch1b 10 GL]]
*[[CSC/ECE 517 Spring 2015/ch1b 27 VC]]
*[[CSC/ECE 517 Spring 2015/ch1b 22 SF]]
*[[CSC/ECE 517 Spring 2015/ch1b 15 SH]]
*[[CSC/ECE 517 Spring 2015/ch1b 18 AS]]

==Writing Assignment 2==
*[[CSC/ECE 517 Spring 2015/oss E1502 wwj]]
*[[CSC/ECE 517 Fall 2014/oss E1508 MRS]]
*[[CSC/ECE 517 Spring 2015/oss E1504 IMV]]
*[[CSC/ECE 517 Spring 2015/oss E1505 xzl]]
*[[CSC/ECE 517 Spring 2015/oss E1509 lds]]
*[[CSC/ECE 517 Spring 2015/oss E1510 FLP]]
*[[CSC/ECE 517 Spring 2015/oss E1506 SYZ]]
*[[CSC/ECE 517 Spring 2015/oss S1504 AAC]]
*[[CSC/ECE 517 Spring 2015/oss E1507 DGO]]
*[[CSC/ECE 517 Spring 2015/oss M1502 GVJ]]
*[[CSC/ECE 517 Spring 2015/oss M1503 EDT]]
*[[CSC/ECE 517 Spring 2015/oss E1503 RSA]]
*[[CSC/ECE 517 Spring 2015/oss E1501 YWS]]
*[[CSC/ECE 517 Spring 2015/oss S1501 OA]]

CSC/ECE 517 Spring 2015/ch1a 2 WA

2015-02-17T03:55:26Z

Abhanda3:

Knife
[[File:workstation.png|frame|Source: https://docs.chef.io/chef_quick_overview.html|right]]
Knife is the command line tool for managing Chef nodes. Simply, Chef allows the distribution of server environments between many different servers (called '''nodes'''). Any changes to the primary chef server (called the '''chef repo''' are distributed throughout all the other nodes, while different nodes can have other recipes and send them back to the chef repo. Knife, then, handles the communication between nodes and the chef repo. For example, let's say that there is an object on the chef repo that a node desires. Knife provides the tools to download that object to the node. Knife also allows setting up a node, installing necessary packages, management of users, and much more.

The topic writeup for this page can be found [https://docs.google.com/document/d/1TgBtp7flIPKJwkkShgtcIkt--mtHuwVHsQX6Tpzj1rc/edit here].

== Background ==

Chef streamlines the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms. Chef is an IT and environment distribution system, built to be robust and scalable. Imagine a large scale internet based company. Depending on the demands of the company, they may have a need for development workstations, servers for purchasing products, servers for hosting games, even clients for games or accessing databases. All of these different environments are called ''chef client nodes'', with each different setup as different ''environments''. For example, you might have ten chef client nodes set up with the development environment, and one hundred chef client nodes set up with the server environment. The chef server contains all of the ''recipes'', reusable code that may be needed both for use and setup of different chef client nodes, to set up all of the different environments that have been defined. In addition, the chef server works like other versioning software. This means you can roll out updates conveniently to all of the different chef client nodes with ease. This allows the entire collection of chef, the recipes, environments, and nodes, called the ''infrastructure'', to be managed simply and in defined, scalable versions.
Chef is more complex and allows other little tricks and nuances. If you want to learn more about Chef itself, beyond the description found here, check out the comparison between Chef and Puppet here: [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_517_Fall_2013/ch1_1w10_ga] or take a look at the official Chef website here: <ref>[https://www.chef.io/ Chef]</ref>, which are the sources for the information in this background. This description just serves to introduce the concept of chef, so that knife's purpose can be understood. These other pages cover the ideas of Chef and how to use it effectively in more detail. In fact, the main site for Chef has an interactive tutorial in setting up a Chef server, without even having to use a personal server or computer. However, the rest of this article will be describing and covering knife.

== What is Knife? ==

Chef includes two important command-line tools.
*Knife command-line tool
*Chef command-line tool

knife is the command-line tool that provides an interface between a local chef client node and the Chef server. Where the Chef command-line tool is used while working with the chef server and repo, Knife helps us to manage the following:

* Nodes
* JSON data Stores
* Environments
* chef-clients installations on management workstations

Basically, the chef command tool deals with the actual chef installation wherever it is, either a local installation and on the server. It deals only with itself. Knife, however, deals with the setup of all the communication and objects between the chef server and a chef client node. In the figure, we can see that how knife provides an interface between local repository's and Chef server at the workstation.

[[File:chef_knife server.png|frame|Source: http://dev.classmethod.jp/server-side/chef-server-install|center]]

== Using Knife ==

Using Knife is fairly straightforward, with usage following this syntax:

<pre> knife [verb] [object] [options] </pre>

The different verbs, or subcommands, for Knife are as follows, from the Using Knife page<ref> [https://docs.chef.io/knife_using.html Knife Documents]</ref>: bootstrap, client, configure, cookbook, cookbook site, data bag, delete, deps, diff, download, edit, environment, exec, index rebuild, list, node, recipe list, role, search, show, ssh, status, tag, upload, user, and xargs.

'''Warning:''' Before you run many of these commands, you should have the knife editor set correctly. In the chef-repo, there are multiple ruby files that set the configuration of the environment. Inside of knife.rb, you need to add or set the following line.

<pre>
<syntaxhighlight lang="ruby">
knife[:editor] = "/usr/bin/vim"
</syntaxhighlight>
</pre>

The path can be set to the text editor of choice, the above command sets the editor to vim.

The example we will use is also one that is useful for setting up nodes, and that is the usage of the bootstrap command. To set up a node, go to the command line on the chef repo and execute the following command:

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo </pre>

First, the verb used is '''bootstrap'''. Bootstrap is the subcommand that allows the installation of a chef client on to a targeted node.
Second, the object is the ip_address_of_node. Every subcommand takes an object, and since chef operates on many ruby principles and scripts, the object can be most anything. For boostrap, this object is the ip of the server address. For other commands, this might be something different.
Third, we have the options for the bootstrap command and the object. For bootstrap, this includes the username and password for the node, and the option --sudo, to make sure all commands are executed correctly.

Before we do the next example, let's cover an important part of object strings and how they are treated by knife and the chef repo.

'''Wildcards'''
A wildcard search can be used similar to standard regex commands, with one notable difference. The wildcard character itself '''must be escaped using the \ character.''' Here is an example from the "Using Knife" page on the chef website. Let's say the following was used to search for an object:

<pre>data_bags/a\*</pre>

Will return all of the objects that start with data_bags\a on the chef-node. However, if we were to run the following:

<pre>data_bags/a*</pre>

Will only search for objects on the chef repo corresponding to objects starting with data_bags/a on the node. Therefore, the * was applied '''before''' being sent to the chef repo, instead of after. So, as an example, let us list all of the data bags that are on the server that start with 'a'. Our object, then, is the string above, <code>data_bags/a\*</code>. The verb that lists different objects on the chef repo is called <code>list</code>. Our command we would run would be this:

<pre>knife list data_bags/a\*</pre>

Which would print to the console all of the objects in the data_bags directory that started with a.

== Bootstrapping A Node ==

One of the key uses of knife is doing something called "bootstrapping" a node <ref>[http://docs.chef.io/open_source/install_bootstrap.html Install Bootstrap]</ref>. This key tool allows someone using the Chef server to set up a remote node, or client. Not only does knife set up all of the relevant recipes, environments, and other such things related with the chef repo itself, it also installs all of the tools necessary for interacting with the chef server and configuring the chef client. Because of this, a basic understanding in how to bootstrap a node, and some knowledge on the available options for bootstrapping a node, is important.

Luckily, bootstrapping a node is very simple. To bootstrap a node, the general command would look like this. Be aware, this is being run from the chef server.

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo --node-name name_of_node</pre>

This process will create a chef client at the designated IP adress. The username and the password is the username and password of a root access user on the remote node. The name of the node can be set here as well, instead of automatically using other settings. The option is shown here, to make the rest of the example easier to understand. Luckily, the knife bootstrap command uses an omnibus installer that automatically detects the OS of the target machine and will install all of the necessary command line tools and internal installations, like ruby, for the chef client to function. Once the bootstrap command is complete, the following message will be displayed.
<pre>INFO: Report handlers complete</pre>

However, before continuing, confirmation of the remote node is necessary. To confirm that the remote node was installed correctly and is running, run the following command.

<pre>knife client show name_of_node</pre>

Here, the name_of_node is the name of the node either set in other options and commands, such as the --node-name used above. When this is run, the node's information should be displayed, such as if the node is an admin, the name of the node, and the JSON type it was created from.

<pre>admin: false
chef_type: client
json_class: Chef::ApiClient
name: name_of_node
public_key:</pre>

== Bootstrap Options and Knife Settings ==
As one of the most powerful tools knife has to offer, the bootstrap function has many different options available to it. The "Installing Bootstrap" section covered three such options (-x, -P, and --node-name), but there are many more available that are useful. Here are a few that stand out as being useful to someone using knife bootstrap for the first time, or for people who are dealing with something outside of the normal conditions for installing and using knife bootstrap. In addition, adding different settings to the knife.rb file will be covered, for people looking to expand the usage of knife and knife bootstrap. Most of this information is covered in higher detail in the main knife bootstrap page, as well as all the other options. You can check that page out using the reference here.<ref>[http://docs.chef.io/knife_bootstrap.html Knife Boostrap]</ref>

'''--bootstrap-curl-options OPTIONS, --bootstrap-install-command COMMAND, --bootstrap-install-sh URL'''

These commands allow customization of the installation bootstrap performs. They are all mutually exclusive, but allow different ways to perform this customization. The first, bootstap-curl-options, which allows additional cURL options to be performed alongside the bootstrap installation. To learn more about cURL, check out their main web page here: ([http://curl.haxx.se/ cURL Home Page]). The second, bootstrap-install-command, allows the execution of another install command on the target node alongside the bootstrap function. The final one, bootstrap-install-sh, allows a custom install script located at the designated URL to be run, following the bootstrap command.

'''--environment ENVIRONMENT, --boostrap-template TEMPLATE'''

The --environment and --bootstrap-template commands are similar, in that they allow customization of the target node. One allows the node to be set up against a target environment, which is useful if your chef server has multiple environments available. The next allows different node setups to be saved as templates. These can be used to quickly and efficiently set up additional similar nodes, especially if the additional installation options in bootstrap are long.

'''-V'''

A small option for bootstrap, the -V option forces the initial chief-client setup to be run on the target node at the debug log level, following the bootstrap installation. This allows almost complete remote setup of new nodes, assuring that the node is ready to be used and the chef-client is fully operational and doesn't need further setup. In addition, the debug log will allow any errors in the chef client to be saved, in the case of something wrong occurring with the node. The exact command run is shown below:
<pre>chef-client -l debug</pre>

'''Custom settings and adding them to the knife.rb file'''

For some settings, including settings related to bootstrap, the options will need to be added to the knife.rb file. This file contains all of the commands for knife, and all of the different options available to these commands. For example, to use the --bootstrap-template and the --sudo options, the following commands need to be executed:
<pre>knife[:template_file]
knife[:use_sudo]</pre>
These commands add the template_file and use_sudo options to the knife.rb file, which will enable those options to be used in the bootstrap command. There are more options that can be added using the knife.rb file, the full knife documentation for commands are here: ([https://docs.chef.io/chef/knife.html# Knife Reference]).

== References ==
<references/>

CSC/ECE 517 Spring 2015/ch1a 2 WA

2015-02-17T03:49:48Z

Abhanda3: /* What is Knife? */

Knife
[[File:workstation.png|frame|Source: https://docs.chef.io/chef_quick_overview.html|right]]
Knife is the command line tool for managing Chef nodes. Simply, Chef allows the distribution of server environments between many different servers (called '''nodes'''). Any changes to the primary chef server (called the '''chef repo''' are distributed throughout all the other nodes, while different nodes can have other recipes and send them back to the chef repo. Knife, then, handles the communication between nodes and the chef repo. For example, let's say that there is an object on the chef repo that a node desires. Knife provides the tools to download that object to the node. Knife also allows setting up a node, installing necessary packages, management of users, and much more.

The topic writeup for this page can be found [https://docs.google.com/document/d/1TgBtp7flIPKJwkkShgtcIkt--mtHuwVHsQX6Tpzj1rc/edit here].

== Background ==

Chef streamlines the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms. Chef is an IT and environment distribution system, built to be robust and scalable. Imagine a large scale internet based company. Depending on the demands of the company, they may have a need for development workstations, servers for purchasing products, servers for hosting games, even clients for games or accessing databases. All of these different environments are called ''chef client nodes'', with each different setup as different ''environments''. For example, you might have ten chef client nodes set up with the development environment, and one hundred chef client nodes set up with the server environment. The chef server contains all of the ''recipes'', reusable code that may be needed both for use and setup of different chef client nodes, to set up all of the different environments that have been defined. In addition, the chef server works like other versioning software. This means you can roll out updates conveniently to all of the different chef client nodes with ease. This allows the entire collection of chef, the recipes, environments, and nodes, called the ''infrastructure'', to be managed simply and in defined, scalable versions.
Chef is more complex and allows other little tricks and nuances. If you want to learn more about Chef itself, beyond the description found here, check out the comparison between Chef and Puppet here: [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_517_Fall_2013/ch1_1w10_ga] or take a look at the official Chef website here: <ref>[https://www.chef.io/ Chef]</ref>, which are the sources for the information in this background. This description just serves to introduce the concept of chef, so that knife's purpose can be understood. These other pages cover the ideas of Chef and how to use it effectively in more detail. In fact, the main site for Chef has an interactive tutorial in setting up a Chef server, without even having to use a personal server or computer. However, the rest of this article will be describing and covering knife.

== What is Knife? ==

Chef includes two important command-line tools.
*Knife command-line tool
*Chef command-line tool

knife is the command-line tool that provides an interface between a local chef client node and the Chef server. Where the Chef command-line tool is used while working with the chef server and repo, Knife helps us to manage the following:

* Nodes
* JSON data Stores
* Environments
* chef-clients installations on management workstations

Basically, the chef command tool deals with the actual chef installation wherever it is, either a local installation and on the server. It deals only with itself. Knife, however, deals with the setup of all the communication and objects between the chef server and a chef client node. In the figure, we can see that how knife provides an interface between local repository's and Chef server at the workstation.

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"> [[File:chef_knife server.png]] <ref>[http://dev.classmethod.jp/server-side/chef-server-install/ Chef Server]</ref></div>

== Using Knife ==

Using Knife is fairly straightforward, with usage following this syntax:

<pre> knife [verb] [object] [options] </pre>

The different verbs, or subcommands, for Knife are as follows, from the Using Knife page<ref> [https://docs.chef.io/knife_using.html Knife Documents]</ref>: bootstrap, client, configure, cookbook, cookbook site, data bag, delete, deps, diff, download, edit, environment, exec, index rebuild, list, node, recipe list, role, search, show, ssh, status, tag, upload, user, and xargs.

'''Warning:''' Before you run many of these commands, you should have the knife editor set correctly. In the chef-repo, there are multiple ruby files that set the configuration of the environment. Inside of knife.rb, you need to add or set the following line.

<pre>
<syntaxhighlight lang="ruby">
knife[:editor] = "/usr/bin/vim"
</syntaxhighlight>
</pre>

The path can be set to the text editor of choice, the above command sets the editor to vim.

The example we will use is also one that is useful for setting up nodes, and that is the usage of the bootstrap command. To set up a node, go to the command line on the chef repo and execute the following command:

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo </pre>

First, the verb used is '''bootstrap'''. Bootstrap is the subcommand that allows the installation of a chef client on to a targeted node.
Second, the object is the ip_address_of_node. Every subcommand takes an object, and since chef operates on many ruby principles and scripts, the object can be most anything. For boostrap, this object is the ip of the server address. For other commands, this might be something different.
Third, we have the options for the bootstrap command and the object. For bootstrap, this includes the username and password for the node, and the option --sudo, to make sure all commands are executed correctly.

Before we do the next example, let's cover an important part of object strings and how they are treated by knife and the chef repo.

'''Wildcards'''
A wildcard search can be used similar to standard regex commands, with one notable difference. The wildcard character itself '''must be escaped using the \ character.''' Here is an example from the "Using Knife" page on the chef website. Let's say the following was used to search for an object:

<pre>data_bags/a\*</pre>

Will return all of the objects that start with data_bags\a on the chef-node. However, if we were to run the following:

<pre>data_bags/a*</pre>

Will only search for objects on the chef repo corresponding to objects starting with data_bags/a on the node. Therefore, the * was applied '''before''' being sent to the chef repo, instead of after. So, as an example, let us list all of the data bags that are on the server that start with 'a'. Our object, then, is the string above, <code>data_bags/a\*</code>. The verb that lists different objects on the chef repo is called <code>list</code>. Our command we would run would be this:

<pre>knife list data_bags/a\*</pre>

Which would print to the console all of the objects in the data_bags directory that started with a.

== Bootstrapping A Node ==

One of the key uses of knife is doing something called "bootstrapping" a node <ref>[http://docs.chef.io/open_source/install_bootstrap.html Install Bootstrap]</ref>. This key tool allows someone using the Chef server to set up a remote node, or client. Not only does knife set up all of the relevant recipes, environments, and other such things related with the chef repo itself, it also installs all of the tools necessary for interacting with the chef server and configuring the chef client. Because of this, a basic understanding in how to bootstrap a node, and some knowledge on the available options for bootstrapping a node, is important.

Luckily, bootstrapping a node is very simple. To bootstrap a node, the general command would look like this. Be aware, this is being run from the chef server.

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo --node-name name_of_node</pre>

This process will create a chef client at the designated IP adress. The username and the password is the username and password of a root access user on the remote node. The name of the node can be set here as well, instead of automatically using other settings. The option is shown here, to make the rest of the example easier to understand. Luckily, the knife bootstrap command uses an omnibus installer that automatically detects the OS of the target machine and will install all of the necessary command line tools and internal installations, like ruby, for the chef client to function. Once the bootstrap command is complete, the following message will be displayed.
<pre>INFO: Report handlers complete</pre>

However, before continuing, confirmation of the remote node is necessary. To confirm that the remote node was installed correctly and is running, run the following command.

<pre>knife client show name_of_node</pre>

Here, the name_of_node is the name of the node either set in other options and commands, such as the --node-name used above. When this is run, the node's information should be displayed, such as if the node is an admin, the name of the node, and the JSON type it was created from.

<pre>admin: false
chef_type: client
json_class: Chef::ApiClient
name: name_of_node
public_key:</pre>

== Bootstrap Options and Knife Settings ==
As one of the most powerful tools knife has to offer, the bootstrap function has many different options available to it. The "Installing Bootstrap" section covered three such options (-x, -P, and --node-name), but there are many more available that are useful. Here are a few that stand out as being useful to someone using knife bootstrap for the first time, or for people who are dealing with something outside of the normal conditions for installing and using knife bootstrap. In addition, adding different settings to the knife.rb file will be covered, for people looking to expand the usage of knife and knife bootstrap. Most of this information is covered in higher detail in the main knife bootstrap page, as well as all the other options. You can check that page out using the reference here.<ref>[http://docs.chef.io/knife_bootstrap.html Knife Boostrap]</ref>

'''--bootstrap-curl-options OPTIONS, --bootstrap-install-command COMMAND, --bootstrap-install-sh URL'''

These commands allow customization of the installation bootstrap performs. They are all mutually exclusive, but allow different ways to perform this customization. The first, bootstap-curl-options, which allows additional cURL options to be performed alongside the bootstrap installation. To learn more about cURL, check out their main web page here: ([http://curl.haxx.se/ cURL Home Page]). The second, bootstrap-install-command, allows the execution of another install command on the target node alongside the bootstrap function. The final one, bootstrap-install-sh, allows a custom install script located at the designated URL to be run, following the bootstrap command.

'''--environment ENVIRONMENT, --boostrap-template TEMPLATE'''

The --environment and --bootstrap-template commands are similar, in that they allow customization of the target node. One allows the node to be set up against a target environment, which is useful if your chef server has multiple environments available. The next allows different node setups to be saved as templates. These can be used to quickly and efficiently set up additional similar nodes, especially if the additional installation options in bootstrap are long.

'''-V'''

A small option for bootstrap, the -V option forces the initial chief-client setup to be run on the target node at the debug log level, following the bootstrap installation. This allows almost complete remote setup of new nodes, assuring that the node is ready to be used and the chef-client is fully operational and doesn't need further setup. In addition, the debug log will allow any errors in the chef client to be saved, in the case of something wrong occurring with the node. The exact command run is shown below:
<pre>chef-client -l debug</pre>

'''Custom settings and adding them to the knife.rb file'''

For some settings, including settings related to bootstrap, the options will need to be added to the knife.rb file. This file contains all of the commands for knife, and all of the different options available to these commands. For example, to use the --bootstrap-template and the --sudo options, the following commands need to be executed:
<pre>knife[:template_file]
knife[:use_sudo]</pre>
These commands add the template_file and use_sudo options to the knife.rb file, which will enable those options to be used in the bootstrap command. There are more options that can be added using the knife.rb file, the full knife documentation for commands are here: ([https://docs.chef.io/chef/knife.html# Knife Reference]).

== References ==
<references/>

CSC/ECE 517 Spring 2015/ch1a 2 WA

2015-02-09T20:59:19Z

Abhanda3: /* Examples */

CSC/ECE 517 Spring 2015/ch1a 2 WA

2015-02-09T20:57:52Z

Abhanda3:

File:Workstation.png

2015-02-09T20:57:20Z

Abhanda3:

CSC/ECE 517 Spring 2015/ch1a 2 WA

2015-02-09T20:56:59Z

Abhanda3:

Knife
[[File:workstation.png|frame|Source: https://docs.chef.io/chef_quick_overview.html/|right]]
Knife is the command line tool for managing Chef nodes. Simply, Chef allows the distribution of server environments between many different servers (called '''nodes'''). Any changes to the primary chef server (called the '''chef repo''' are distributed throughout all the other nodes, while different nodes can have other recipes and send them back to the chef repo. Knife, then, handles the communication between nodes and the chef repo. For example, let's say that there is an object on the chef repo that a node desires. Knife provides the tools to download that object to the node. Knife also allows setting up a node, installing necessary packages, management of users, and much more.

== Background ==

Chef streamlines the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms.
If you are unfamiliar with Chef and how it works, check out the comparison between Chef and Puppet here: [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_517_Fall_2013/ch1_1w10_ga] or take a look at the official Chef website here: <ref>[https://www.chef.io/ Chef]</ref>. There, a deeper understanding of Chef can be attained.

Chef includes two important command-line tools.
*Knife command-line tool
*Chef command-line tool

knife is the command-line tool that provides an interface between a local chef-repo and the Chef server. Whereas, Chef command-line tool is used while working with the chef repo. Knife helps us to manage the following.

* Nodes
* JSON data Stores
* Environments
* chef-clients installations on management workstations

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"> [[File:chef_knife server.png]] <ref>[http://dev.classmethod.jp/server-side/chef-server-install/ Chef Server]</ref></div>

In the figure, we can see that how knife provides an interface between local repository's and Chef server at the workstation.

== Examples ==

Using Knife is fairly straightforward, with usage following this syntax:

<pre> knife [verb] [object] [options] </pre>

The different verbs, or subcommands, for Knife are as follows, from the Using Knife page<ref> [https://docs.chef.io/knife_using.html Knife Documents]</ref>: bootstrap, client, configure, cookbook, cookbook site, data bag, delete, deps, diff, download, edit, environment, exec, index rebuild, list, node, recipe list, role, search, show, ssh, status, tag, upload, user, and xargs.

'''Warning:''' Before you run many of these commands, you should have the knife editor set correctly. In the chef-repo, there are multiple ruby files that set the configuration of the environment. Inside of knife.rb, you need to add or set the following line.

<pre>
<syntaxhighlight lang="ruby">
knife[:editor] = "/usr/bin/vim"
</syntaxhighlight>
</pre>

The path can be set to the text editor of choice, the above command sets the editor to vim.

The example we will use is also one that is useful for setting up nodes, and that is the usage of the bootstrap command. To set up a node, go to the command line on the chef repo and execute the following command:

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo </pre>

First, the verb used is '''bootstrap'''. Bootstrap is the subcommand that allows the installation of a chef client on to a targeted node.
Second, the object is the ip_address_of_node. Every subcommand takes an object, and since chef operates on many ruby principles and scripts, the object can be most anything. For boostrap, this object is the ip of the server address. For other commands, this might be something different.
Third, we have the options for the bootstrap command and the object. For bootstrap, this includes the username and password for the node, and the option --sudo, to make sure all commands are executed correctly.

Before we do the next example, let's cover an important part of object strings and how they are treated by knife and the chef repo.

'''Wildcards'''
A wildcard search can be used similar to standard regex commands, with one notable difference. The wildcard character itself '''must be escaped using the \ character.''' Here is an example from the "Using Knife" page on the chef website. Let's say the following was used to search for an object:

<pre>data_bags/a\*</pre>

Will return all of the objects that start with data_bags\a on the chef-node. However, if we were to run the following:

<pre>data_bags/a*</pre>

Will only search for objects on the chef repo corresponding to objects starting with data_bags/a on the node. Therefore, the * was applied '''before''' being sent to the chef repo, instead of after. So, as an example, let us list all of the data bags that are on the server that start with 'a'. Our object, then, is the string above, <code>data_bags/a\*</code>. The verb that lists different objects on the chef repo is called <code>list</code>. Our command we would run would be this:

<pre>knife list data_bags/a\*</pre>

Which would print to the console all of the objects in the data_bags directory that started with a.

== Bootstrapping A Node ==

One of the key uses of knife is doing something called "bootstrapping" a node <ref>[http://docs.chef.io/open_source/install_bootstrap.html Install Bootstrap]</ref>. This key tool allows someone using the Chef server to set up a remote node, or client. Not only does knife set up all of the relevant recipes, environments, and other such things related with the chef repo itself, it also installs all of the tools necessary for interacting with the chef server and configuring the chef client. Because of this, a basic understanding in how to bootstrap a node, and some knowledge on the available options for bootstrapping a node, is important.

Luckily, bootstrapping a node is very simple. To bootstrap a node, the general command would look like this. Be aware, this is being run from the chef server.

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo --node-name name_of_node</pre>

This process will create a chef client at the designated IP adress. The username and the password is the username and password of a root access user on the remote node. The name of the node can be set here as well, instead of automatically using other settings. The option is shown here, to make the rest of the example easier to understand. Luckily, the knife bootstrap command uses an omnibus installer that automatically detects the OS of the target machine and will install all of the necessary command line tools and internal installations, like ruby, for the chef client to function. Once the bootstrap command is complete, the following message will be displayed.
<pre>INFO: Report handlers complete</pre>

However, before continuing, confirmation of the remote node is necessary. To confirm that the remote node was installed correctly and is running, run the following command.

<pre>knife client show name_of_node</pre>

Here, the name_of_node is the name of the node either set in other options and commands, such as the --node-name used above. When this is run, the node's information should be displayed, such as if the node is an admin, the name of the node, and the JSON type it was created from.

<pre>admin: false
chef_type: client
json_class: Chef::ApiClient
name: name_of_node
public_key:</pre>

== Bootstrap Options and Knife Settings ==
As one of the most powerful tools knife has to offer, the bootstrap function has many different options available to it. The "Installing Bootstrap" section covered three such options (-x, -P, and --node-name), but there are many more available that are useful. Here are a few that stand out as being useful to someone using knife bootstrap for the first time, or for people who are dealing with something outside of the normal conditions for installing and using knife bootstrap. In addition, adding different settings to the knife.rb file will be covered, for people looking to expand the usage of knife and knife bootstrap. Most of this information is covered in higher detail in the main knife bootstrap page, as well as all the other options. You can check that page out using the reference here.<ref>[http://docs.chef.io/knife_bootstrap.html Knife Boostrap]</ref>

'''--bootstrap-curl-options OPTIONS, --bootstrap-install-command COMMAND, --bootstrap-install-sh URL'''

These commands allow customization of the installation bootstrap performs. They are all mutually exclusive, but allow different ways to perform this customization. The first, bootstap-curl-options, which allows additional cURL options to be performed alongside the bootstrap installation. To learn more about cURL, check out their main web page [http://curl.haxx.se/ here (cURL Home Page)].

== References ==
<references/>

CSC/ECE 517 Spring 2015/ch1a 2 WA

2015-02-09T20:44:52Z

Abhanda3: /* Background */

Knife

Knife is the command line tool for managing Chef nodes. Simply, Chef allows the distribution of server environments between many different servers (called '''nodes'''). Any changes to the primary chef server (called the '''chef repo''' are distributed throughout all the other nodes, while different nodes can have other recipes and send them back to the chef repo. Knife, then, handles the communication between nodes and the chef repo. For example, let's say that there is an object on the chef repo that a node desires. Knife provides the tools to download that object to the node. Knife also allows setting up a node, installing necessary packages, management of users, and much more.

== Background ==

Chef streamlines the task of configuring and maintaining a company's servers, and can integrate with cloud-based platforms.
If you are unfamiliar with Chef and how it works, check out the comparison between Chef and Puppet here: [http://wiki.expertiza.ncsu.edu/index.php/CSC/ECE_517_Fall_2013/ch1_1w10_ga] or take a look at the official Chef website here: <ref>[https://www.chef.io/ Chef]</ref>. There, a deeper understanding of Chef can be attained.

Chef includes two important command-line tools.
*Knife command-line tool
*Chef command-line tool

knife is the command-line tool that provides an interface between a local chef-repo and the Chef server. Whereas, Chef command-line tool is used while working with the chef repo. Knife helps us to manage the following.

* Nodes
* JSON data Stores
* Environments
* chef-clients installations on management workstations

<div class="center" style="width: auto; margin-left: auto; margin-right: auto;"> [[File:chef_knife server.png]] <ref>[http://dev.classmethod.jp/server-side/chef-server-install/ Chef Server]</ref></div>

In the figure, we can see that how knife provides an interface between local repository's and Chef server at the workstation.

== Examples ==

Using Knife is fairly straightforward, with usage following this syntax:

<pre> knife [verb] [object] [options] </pre>

The different verbs, or subcommands, for Knife are as follows, from the Using Knife page<ref> [https://docs.chef.io/knife_using.html Knife Documents]</ref>: bootstrap, client, configure, cookbook, cookbook site, data bag, delete, deps, diff, download, edit, environment, exec, index rebuild, list, node, recipe list, role, search, show, ssh, status, tag, upload, user, and xargs.

'''Warning:''' Before you run many of these commands, you should have the knife editor set correctly. In the chef-repo, there are multiple ruby files that set the configuration of the environment. Inside of knife.rb, you need to add or set the following line.

<pre>
<syntaxhighlight lang="ruby">
knife[:editor] = "/usr/bin/vim"
</syntaxhighlight>
</pre>

The path can be set to the text editor of choice, the above command sets the editor to vim.

The example we will use is also one that is useful for setting up nodes, and that is the usage of the bootstrap command. To set up a node, go to the command line on the chef repo and execute the following command:

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo </pre>

First, the verb used is '''bootstrap'''. Bootstrap is the subcommand that allows the installation of a chef client on to a targeted node.
Second, the object is the ip_address_of_node. Every subcommand takes an object, and since chef operates on many ruby principles and scripts, the object can be most anything. For boostrap, this object is the ip of the server address. For other commands, this might be something different.
Third, we have the options for the bootstrap command and the object. For bootstrap, this includes the username and password for the node, and the option --sudo, to make sure all commands are executed correctly.

Before we do the next example, let's cover an important part of object strings and how they are treated by knife and the chef repo.

'''Wildcards'''
A wildcard search can be used similar to standard regex commands, with one notable difference. The wildcard character itself '''must be escaped using the \ character.''' Here is an example from the "Using Knife" page on the chef website. Let's say the following was used to search for an object:

<pre>data_bags/a\*</pre>

Will return all of the objects that start with data_bags\a on the chef-node. However, if we were to run the following:

<pre>data_bags/a*</pre>

Will only search for objects on the chef repo corresponding to objects starting with data_bags/a on the node. Therefore, the * was applied '''before''' being sent to the chef repo, instead of after. So, as an example, let us list all of the data bags that are on the server that start with 'a'. Our object, then, is the string above, <code>data_bags/a\*</code>. The verb that lists different objects on the chef repo is called <code>list</code>. Our command we would run would be this:

<pre>knife list data_bags/a\*</pre>

Which would print to the console all of the objects in the data_bags directory that started with a.

== Bootstrapping A Node ==

One of the key uses of knife is doing something called "bootstrapping" a node <ref>[http://docs.chef.io/open_source/install_bootstrap.html Install Bootstrap]</ref>. This key tool allows someone using the Chef server to set up a remote node, or client. Not only does knife set up all of the relevant recipes, environments, and other such things related with the chef repo itself, it also installs all of the tools necessary for interacting with the chef server and configuring the chef client. Because of this, a basic understanding in how to bootstrap a node, and some knowledge on the available options for bootstrapping a node, is important.

Luckily, bootstrapping a node is very simple. To bootstrap a node, the general command would look like this. Be aware, this is being run from the chef server.

<pre>knife bootstrap ip_address_of_node -x username -P password --sudo --node-name name_of_node</pre>

This process will create a chef client at the designated IP adress. The username and the password is the username and password of a root access user on the remote node. The name of the node can be set here as well, instead of automatically using other settings. The option is shown here, to make the rest of the example easier to understand. Luckily, the knife bootstrap command uses an omnibus installer that automatically detects the OS of the target machine and will install all of the necessary command line tools and internal installations, like ruby, for the chef client to function. Once the bootstrap command is complete, the following message will be displayed.
<pre>INFO: Report handlers complete</pre>

However, before continuing, confirmation of the remote node is necessary. To confirm that the remote node was installed correctly and is running, run the following command.

<pre>knife client show name_of_node</pre>

Here, the name_of_node is the name of the node either set in other options and commands, such as the --node-name used above. When this is run, the node's information should be displayed, such as if the node is an admin, the name of the node, and the JSON type it was created from.

<pre>admin: false
chef_type: client
json_class: Chef::ApiClient
name: name_of_node
public_key:</pre>

== Bootstrap Options and Knife Settings ==
As one of the most powerful tools knife has to offer, the bootstrap function has many different options available to it. The "Installing Bootstrap" section covered three such options (-x, -P, and --node-name), but there are many more available that are useful. Here are a few that stand out as being useful to someone using knife bootstrap for the first time, or for people who are dealing with something outside of the normal conditions for installing and using knife bootstrap. In addition, adding different settings to the knife.rb file will be covered, for people looking to expand the usage of knife and knife bootstrap. Most of this information is covered in higher detail in the main knife bootstrap page, as well as all the other options. You can check that page out using the reference here.<ref>[http://docs.chef.io/knife_bootstrap.html Knife Boostrap]</ref>

== References ==
<references/>