OCR for FME: “Now I know my ABC”
Searching through long-forgotten stuff hidden for years in the attic can sometimes bring unexpected treasures. Look at this rare issue of “FME Insider” from pre-cloud days, maybe even from pre-digital times!
Having such a priceless object, we take all the necessary measures to preserve this precious piece of history, but wouldn’t it be great to share it with all — across space and time?
We can do it easily by taking a picture of the artefact (this is what I did for this article), but how can we make the text more usable for generations to come? How can we make it searchable, indexable, and quotable to simplify the life of researchers and all FME fans?
The answer is, we have to convert the image to text. The technology for such a transformation does exist. It is called OCR, or Optical Character Recognition, and now it is available in FME through an integration with Tesseract.
Below I will talk about how FME and Tesseract work together, and how they make each other more powerful.
How TesseractCaller works
Tesseract part: Raster to plain text / xhtml
Tesseract is an open-source command line tool. It takes rasters as input, performs optical character recognition, and outputs either plain text or hOCR, an xhtml code that preserves text, style, layout, and other information about the scanned material.
Tesseract must be downloaded and installed before FME can use it. The tool can recognize over 100 languages, but each language must be installed before using it.
Fun fact: A Tesseract is also
secret weapon a 4-dimensional cube.
FME part: Pre- and Post-processing
FME prepares rasters for Tesseract and processes the output. If the user decides to output text as xhtml, FME analyzes the code and outputs separately: words, words assembled into lines, lines assembled into paragraphs, and paragraphs assembled into pages. Also, FME outputs bounding boxes with associated text for each object.
Preparing rasters for Tesseract
TesseractCaller works with rasters having RGB24 interpretation (three 8-bit color bands). FME can process any raster to match this requirement. For example, if a source image has an alpha band (that is, its interpretation is RGBA32), RasterSelector and RasterBandRemover can remove it from the image.
A more advanced manipulation involves raster expressions. Tesseract produces the best results when there is high contrast between the text and background. By applying raster expressions, we can make dark colors representing text really black, and light tones of the background really white.
The following expression applied to each band sets the pixel color to 0 (black) if its original values are below 100, and to 255 (white) otherwise:
which yields the following result:
TesseractCaller allows recognition of text written in two languages. The users can specify primary and secondary languages as well as character encoding.
Processing the results
When the user chooses the “Text” option for the output, TesseractCaller simply outputs the original feature with a new attribute fme_text_string that contains the whole text recognized on a page:
The “hOCR (xhtml)” option produces an XML output that can look as follows:
The XML code represents a hierarchy of elements, which begins from a page (class=’ocr_page’) down to a single word (class=’ocrx_word’). Throughout the whole hierarchy, all elements get a unique ID. The recognized text is located on the lowest level of the hierarchy — that is, on the word level. Each element also has a parameter that defines its bounding box.
TesseractCaller goes through the XML and, using lots of XMLXQueryExtractors, extracts all elements, passes all the IDs down, and all the words — up to the top of the hierarchy. This means words are combined into lines, lines are combined into paragraphs, and paragraphs are assembled into pages.
Each hierarchy level has its own output port through which the elements leave the transformer:
- Page (contains all text found on each page, has its ID)
- Paragraph (contains all text found in each paragraph, has its ID and the ID of the page it belongs to)
- Line (contains all text found in each line, has its ID and the IDs of the paragraph and page it belongs to)
- Word (contains a single word, has its ID and the IDs of the line, paragraph and page it belongs to)
Depending on the final purpose, the user can use any port and get the recognized text in its entirety or split it into pieces of different sizes.
For example, it makes sense to use the whole page when we process a scanned book. If we photographed lots of business cards, it might be better to use lines (names are usually placed on their own line), which we can then try to merge with the existing contact database:
Unlike the human eye, for which only doctors’ handwriting might be beyond recognition, OCR programs are pretty demanding as to the quality of the text.
For best results, Tesseract should receive the text in a pretty high resolution.
- Tesseract Tip #1: The height of a single character should be about 20 or more pixels.
Also, rotation may lead to less satisfactory results or no results at all.
- Tesseract Tip #2: Supply the text as horizontally as possible.
This means Tesseract is good with materials like scanned books, documents, and business cards, but it is not really a tool that would allow extracting text features from maps.
Tesseract does not guarantee that the text will be converted with 100% accuracy. Some extra processing can help with this.
- Tesseract Tip #3: Post-processing with FME or other external tools can help correct spelling and/or formatting problems.
Try it yourself
TesseractCaller is another example of how FME can be integrated with the third-party command line tools by building a workspace without any programming. The workspace behind the transformer is quite interesting and it’s worth having a closer look — it shows a few interesting tricks and advanced techniques.
Check, for example, how the command line is created with a series of AttributeCreators with conditional input, or how bad features are being rejected by the workflow. You can even learn a bit of XQuery if you feel really adventurous. Feel free to modify the transformer to exclude the parts that are not required in your workflow — for example, keep only the Paragraph output port and remove the parts processing other elements of the text.
TesseractCaller is a transformer for mostly non-spatial purposes, where FME by itself and especially when empowered by all kinds of external tools, can perform really well.
What piece of history would you like to preserve today?