The majority of current text recognition or OCR (optical character recognition) programs run only on Windows systems and can be purchased for prices from around 100 Euro upwards.
When, however, thousands or millions of pages are to be processed, then expensive volume licenses, that are based on a price per scanned page, are required.
The ArchivistaBox is a web based DMS (document management system), that can be installed on every commercially available computer. Depending on the hardware used, the page volume processed can vary between several thousand up to several million pages per day.
Release of the 2008/IX marks the launch of the first open source text recognition system that is able to generate searchable PDF files directly from scanned pages. More than 20 languages are available and the recognition quality is comparable with that of commercial systems (>99 percent).
Sensitive data can be encrypted before being made available. If required, the ArchivistaBox can create complete DVD publications.
100 % of the source code used in the ArchivistaBox comes under the GPLv2 license. Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition.
The hocr2pdf module (see http://www.exactcode.de) is used to generate the searchable PDF files.

Have you read these related articles?
Newsletter: