Monday, April 26, 2010

Searching 2.0 ---- new frontiers

Searching has entered a new domain with introduction of Google's "Goggles" service, Although goggles is a comprehensive service using which we can make searches based on images, landmarks, books, logos etc... but we will be concentrating on the book title searching ( or any other search which concentrates on extracting text from images and then searching for it).

As far as searching for a book or textual logo goes it can be easily implemented using OCR to extract the text from the image and then searching the text which was earlier extracted from the image.

OCR stands for "Optical character recognition". It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

In this blog post, We will be discussing how to easily extract text from images ourselves and then implement out own visual searching service.

For the purpose of extracting text from the image we will be using The "Tesseract OCR engine". The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

As a wrapper for this engine we will be writing a script in python using an opensource python project called "PyTesser".

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine (an Open Source project at Google), converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in Linux as well.

The example we are going to discuss here is tested on python 2.5, but it should work with newer versions barring version 3.0.

Another component which needs to be added to the python2.5 is PIL. It is required to work with images in memory. PyTesser has been tested with Python 2.5 in Windows XP.

So, now moving on to more exciting part, i.e actually getting something done. Download and extract the PyTesser into the scripts folder of python25 dir. Tesseract OCR engine's binary is included with the PyTesser archive.

We have used IDLE editor for python coding, any other can be used. Eclipse too can be used for python development with "pydev" plugin installed.

After setting up the environment paste this code in the newly initiated project/file.

from pytesser import * # import all packages from pytesser script
image = Image.open('fnord.tif') # Open image object using PIL
print image_to_string(image) # Run tesseract.exe on image
print image_file_to_string('fnord.tif')


We have used the example image provided with the PyTesser package here. Running this code will analyze the image and after extracting the text embedded in the image will print it on the console.

So, now we have to use the extracted text for searching.

This relatively simple thing can be used for multiple scenarios, one can be getting info about the latest offer for a product, or searching for different branches of a chain of stores. Lets discuss the endless possibilities and unleash our creativity here.


Download Links:

python imaging library ( PIL ) : http://www.pythonware.com/products/pil/
PyTesser : http://code.google.com/p/pytesser/
Tesseract OCR engine : http://code.google.com/p/tesseract-ocr/

Saturday, April 17, 2010

Structures and Classes in c++

struct keyword is facing the downside in comparison of class because of its strong attachment to the legacy C code.
But most people ignore the fact that structs have been totally reinvented in c++.

The only difference between struct and class in c++ is that by default everything is public in structs and private in classes, even the default inheritance in struct is public.

thats all the difference between structs and classes, then why in recent programming practices structs have been totally ignored.

structs can be safely initialized using constructors, they can have polymorphic nature, support late binding, in short everything you need to write good OO code.


check out the code given below.

#include
using namespace std;

struct Whatever
{
virtual void foo() { }
virtual ~Whatever() { }
};

struct Derived : Whatever
{
void foo() { cout << "value=" << value; }
Derived(int n) : Whatever(), value(n) { }

private:
int value; // private data member
};

int main()
{
Whatever *pW = new Derived(10);
pW->foo(); // virtual call
delete pW;
return 0;
}


IMHO struct are fighting a loosing battle because of the mindset of the developers, We have been taught C++ means classes, C++ even started out as "C with classes".
anything which needs an aggregate data type can be implemented using structs and anything needed as a complex data type adhering to OOPS principles, representing some real world object needs to be implemented in terms of classes.

There is no problem in using classes or structs for that matter, just need to open our minds to the realities of C++ , rather than following and believing in myths.