Mobile OCR, Face and Object Recognition for the Blind
« The vOICe for Windows
The main goal of The vOICe vision technology is to offer an equivalent of "raw" visual input to blind people, via complex soundscapes, thus leaving the recognition tasks to the human brain. However, complementary to that it would be useful to have options for automatic recognition through computer vision technology. This page challenges object recognition engine developers to demonstrate the applicability of computer vision on mobile devices in real-life situations. It is an open invitation - with an open interface - to deliver con.vincing demonstrations for use with The vOICe for Windows. (Note that The vOICe for Android nowadays includes live mobile OCR for short texts.)
In walking around while wearing The vOICe with a head-mounted camera and stereo headphones - preferably all integrated in video sunglasses - it would be convenient if the blind user could occasionally have any text in the camera view automatically recognized and spoken by The vOICe, using speech recognition and speech synthesis for the user interface. The vOICe now supports all of this functionality by invoking an external OCR engine for optical character recognition (OCR) of any text in the camera view. This could in principle help with reading large print (headlines), street signs (names), name tags and labels on or beside office doors, LCD and LED displays of digital clocks, calculators, VCRs, microwave ovens, elevators, etcetera.
It must be stressed that right now we only present the proof of concept by integrating free OCR engines with The vOICe: the actual text recognition results with most OCR engines will generally prove to be very poor, because they were not yet designed for use with low resolution live video from a PC camera. More development in this area is needed to arrive at more robust text recognition.
In other words, if you are a blind user hoping to find a reliable way of reading text with a wearable camera, you will most likely be very disappointed by what can be achieved right now, but we have to start somewhere or else there will be no progress.
Installation involves the following steps, assuming that you had already downloaded the latest version of The vOICe for Windows executable (voice.exe):
- Download the zipped DLL file vOICeJPG.zip (100 K), and unzip this file to obtain the vOICeJPG.dll file.
- Download the image format converter program djpeg.exe (60 K).
- Download the GOCR version 0.50 Windows executable gocr.exe (about 159 K).
- Move the three files (vOICeJPG.dll, djpeg.exe and gocr.exe) to the same directory where The vOICe for Windows executable (voice.exe) is stored.
Purpose: the DLL file vOICeJPG.dll allows The vOICe program to save images in JPEG format (in addition to the BMP format), while the djpeg.exe program will be used to convert the JPEG image files to the PNM format image files as supported by the GOCR program gocr.exe (Win32 binary). The vOICeJPG.dll and djpeg.exe modules are based on the work of the Independent JPEG Group. The free OCR engine GOCR is an open source OCR project, with the GOCR homepage at jocr.sourceforge.net and www.gocr.de.
Now while running The vOICe, press Control R. The vOICe will then save your current view as an image snapshot file vOICe.jpg (as well as vOICe.bmp), and run the batch file recognize.bat. If this batch file does not exist (in the same directory as voice.exe), it will first be automatically created by The vOICe. By default, this batch file will just contain the two command lines
Note that any non-console programs must be prefixed by "start /w" to ensure that Windows first waits for a program to finish before starting the next program in the batch file, or else crashes may result if the next program attempts to read results written by the previous program. Sometimes it may also be useful to put a small delay in between commands. This can be done with extra command lines like "ping -n 1 127.0.0.1>NUL" that use a dummy timed ping to cause a delay, in this case a one second delay.
So the JPEG snapshot will be converted to PNM format by the djpeg.exe program, and the resulting vOICe.pnm file will form the input for the OCR engine GOCR. The plain text results from GOCR are saved in an ASCII file vOICe.ocr. Once the batch job has finished, The vOICe will take control again and print the (filtered) contents of the vOICe.ocr file to a dialog. When done, The vOICe will resume normal live soundscape generation. In case the batch file window does not automatically disappear once the batch job has finished, check the "Close on exit" checkbox in the Properties | Program tab of the recognize.bat batch file. This should make the window automatically disappear in all later runs.
Simple command line based interfaces like the one used above are also commonly used in benchmarking studies and in competitions, e.g., ICDAR, such that efficient reuse is easily accomplished through very minor modifications, while file I/O is rarely a performance bottleneck as compared to the CPU time spent on the recognition.
Moreover, in case you would want to keep a growing history of snapshots that are not overwritten, you can add lines like
to automatically add a timestamp to a copy of every saved vOICe.jpg file (e.g., "vOICe 2006-07-25 21h12m30s.jpg"). One can view this as related to Microsoft's SenseCam (MyLifeBits) project.
Mobile face and object recognition
Third-party developers can simply modify the contents of the batch file recognize.bat, because The vOICe will not overwrite it once it exists. This open interface makes it very easy to replace the invoked OCR engine and to include other types of visual object and visual pattern recognition engines for use in a cognitive vision system (to implement an auto-tagging "virtual commentator", "virtual reporter" or "virtual sighted guide"). Artificial cognitive systems in general need to be fed with real-life data for training purposes.
The vOICe saves a camera snapshot on request, and the third-party recognition engine then processes the image file and writes a text file.One can think of integration with other more or less specialized applications for image analysis, such as face recognition, object recognition and object categorization, and automatic interpretation of bar codes, currency, signs, or logo's. For instance, one may consider using the Foveola command line tools for shape recognition, as developed by Patrick Andrews of Break-Step Productions Ltd, forming the basis of the SceneReader sign reader for extracting text from images, or face recognition and sign reading technology from Riya, the Microsoft Photo2Search photograph-based search project, approaches that make adaptive (trainable) vision systems through the use of many visual training examples. Startups such as Numenta, founded by Jeff Hawkins, Dileep George and Donna Dubinsky, can test and demonstrate their artificial intelligence capabilities on practical real-world recognition tasks that would supplement The vOICe's direct visual mapping approach. One may also consider building a database of feature signatures for everyday visual object views based on David Lowe's SIFT approach (Scale-Invariant Feature Transform), which formed the basis of the ViPR (visual pattern recognition) technology of Evolution Robotics. Another method is SURF (Speeded Up Robust Features, by Herbert Bay and others). A starting point for testing can be the use of public image databases such as COIL-20 or ETH-80.
Related projects and comments
Related "Mobile OCR for the Blind" projects exist in the form of the knfbReader Mobile from K-NFB Reading Technology, codeveloped by Kurzweil and NFB (KNFB, a Nokia N82 with OCR engine and TTS), the ITEX SiSystem SiRecognizer netbook and the AdvantEdge Reader. The Kurzweil reader is a handheld Pocket PC based device under development in a cooperation between Kurzweil Technologies, Inc. (KTI) and the National Federation of the Blind (NFB). When used with the same OCR engine and resolution settings, recognition performance of The vOICe should be similar to that of commercial portable readers. Other related projects include the Google-sponsored OCRopus OCR project of the IUPR research group, the Sypole project of the Faculté Polytechnique de Mons (FPM), TCTS and Université Libre de Bruxelles in Belgium, kooaba AG, the Trinetra project of Priya Narasimhan of Carnegie Mellon University, as well as the "DORA project" (Digital Object Recognition Audio-Assistant) for the visually impaired by Wolfgang Fink and Mark Tarbell of California Institute of Technology and James Weiland and Mark Humayun of Doheny Eye Institute, University of Southern California, the pedestrian crossing "electronic eye" project by Tadayoshi Shioyama and Mohammad Shorif Uddin at Kyoto Institute of Technology, Japan, on single camera detection of pedestrian crosswalks and traffic lights, the machine vision work by Mark Nitzberg, and Alan Yuille and others at Blindsight Corporation, and Simon Thorpe's SpikeNet Technology approach of biologically inspired object recognition (SNVision) through neural networks consisting of asynchronous spiking neurons.
NEC and NAIST are working on OCR for mobile camera phones, according to the New Scientist article "Camera phones will be high-precision scanners". Google is working on Google Goggles, targeting the possibility of using a camera phone or equivalent to recognize objects and texts from the environment in order to search for information (Google Visual Search).
Reports from participants in The vOICe project suggest that even if automatic visual recognition becomes technically feasible and reliable, many blind people would whenever possible still prefer to learn to "see for themselves" through a more direct non-interpreted visual view such as provided by The vOICe. A robust recognition engine could then serve a useful secondary role as a training tool or assist with special types of patterns.
If your camera supports it, The vOICe automatically temporarily switches to a higher resolution (up to VGA) when taking a snapshot, such that recognition engines can work with that higher resolution snapshot rather than the default 176 by 144 pixel view. The vOICe can also acquire images from a TWAIN compliant flatbed scanner or digital still image camera (Control Q) for subsequent OCR analysis (Control R). Better still, The vOICe can directly acquire a high resolution image from a TWAIN compliant device and apply OCR when pressing Control Alt R (or using the spoken "recognize" command when no video capture device is connected to the computer).Open Source OCR 1: GOCR project
The GOCR project is seeking volunteers for the further development of the GOCR engine and software library. For use of GOCR with The vOICe, it would be particularly welcome if work started on image preprocessing to improve the accuracy in extracting text embedded in video scenes (including
Also note that new command line driven and file-based image processing engines can, if desired, be first developed and tested under Linux, and subsequently ported to Microsoft Windows for combination with The vOICe. This is in fact what happened with the GOCR engine.
GOCR can also recognize barcodes, and Rob Fugina's Internet UPC Database (upcdatabase.com) can be used to retrieve product information associated with recognized barcode numbers.
Open Source OCR 2: Tesseract OCR
Another OCR project is Tesseract OCR. Microsoft Windows executables for Tesseract are available as free downloads at code.google.com. Due to lack of documentation, it is still somewhat unclear what types of BMP files Tesseract supports, but it does appear to support greyscale BMP files. Tesseract is run through command lines like "tesseract phototest.tif output", usually applied in a batch file. One can easily integrate it with The vOICe (which does not generate greyscale BMP files itself) via its open interface, by using the following command lines in recognize.bat,
where the JPEG output from The vOICe is first converted to greyscale BMP, after which the Tesseract OCR engine is invoked to yield a plain text output file vOICe.txt, which is finally moved to a vOICe.ocr file as expected by The vOICe for further processing (dialog popup or synthetic speech output). For the examples on this web page, GOCR appears to perform slightly better than Tesseract.
Yet another open source OCR project is GNU Ocrad, but its use in combination with The vOICe for Windows has not yet been investigated.
Commercial OCR: TopOCR
The TopOCR product now includes a command line interface and one can very easily integrate it with The vOICe via its open interface by using the following single command line in recognize.bat,
Of course one must apply an appropriate path change for the TopOCR executable when using a version other than the free trial version 2.4 that was used here (which yielded "Neural Device _!~a^1" for the first test image on this web page): the executable path may be "C:\Program Files\TopOCR\topocr.exe" for the fully registered version. The TopOCR command line parameters may also need adjustment for best results.