Monday, April 26, 2010

Searching 2.0 ---- new frontiers

Searching has entered a new domain with introduction of Google's "Goggles" service, Although goggles is a comprehensive service using which we can make searches based on images, landmarks, books, logos etc... but we will be concentrating on the book title searching ( or any other search which concentrates on extracting text from images and then searching for it).

As far as searching for a book or textual logo goes it can be easily implemented using OCR to extract the text from the image and then searching the text which was earlier extracted from the image.

OCR stands for "Optical character recognition". It is widely used to convert books and documents into electronic files, to computerize a record-keeping system in an office, or to publish the text on a website. OCR makes it possible to edit the text, search for a word or phrase, store it more compactly, display or print a copy free of scanning artifacts, and apply techniques such as machine translation, text-to-speech and text mining to it. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

In this blog post, We will be discussing how to easily extract text from images ourselves and then implement out own visual searching service.

For the purpose of extracting text from the image we will be using The "Tesseract OCR engine". The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. A tiff reader is built in that will read uncompressed TIFF images, or libtiff can be added to read compressed images.

As a wrapper for this engine we will be writing a script in python using an opensource python project called "PyTesser".

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine (an Open Source project at Google), converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in Linux as well.

The example we are going to discuss here is tested on python 2.5, but it should work with newer versions barring version 3.0.

Another component which needs to be added to the python2.5 is PIL. It is required to work with images in memory. PyTesser has been tested with Python 2.5 in Windows XP.

So, now moving on to more exciting part, i.e actually getting something done. Download and extract the PyTesser into the scripts folder of python25 dir. Tesseract OCR engine's binary is included with the PyTesser archive.

We have used IDLE editor for python coding, any other can be used. Eclipse too can be used for python development with "pydev" plugin installed.

After setting up the environment paste this code in the newly initiated project/file.

from pytesser import * # import all packages from pytesser script
image = Image.open('fnord.tif') # Open image object using PIL
print image_to_string(image) # Run tesseract.exe on image
print image_file_to_string('fnord.tif')


We have used the example image provided with the PyTesser package here. Running this code will analyze the image and after extracting the text embedded in the image will print it on the console.

So, now we have to use the extracted text for searching.

This relatively simple thing can be used for multiple scenarios, one can be getting info about the latest offer for a product, or searching for different branches of a chain of stores. Lets discuss the endless possibilities and unleash our creativity here.


Download Links:

python imaging library ( PIL ) : http://www.pythonware.com/products/pil/
PyTesser : http://code.google.com/p/pytesser/
Tesseract OCR engine : http://code.google.com/p/tesseract-ocr/

1 comment:

  1. I recently came across a website on OCR technology, containing all kinds of information about OCR software, news about companies and developers. Have you heard of it? www.ocrworld.com

    ReplyDelete