Register

To become a member of ITProPortal Register here.

Already a member? Login here

Please register below. All we need is a valid email address and a password.

Please use a real email address as we need to email you to confirm your account.
Must be at least 6 characters long.

Benefits of joining ITProPortal:

  • Unlimited Access to Special Reports and White Papers
  • Exclusive offers and discounts
  • Free entry to all competitions
  • Access to beta sections of ITProPortal.com

Login to your account



Forgot your password?


Swiss Firm Launches World's First Open Source Text Recognition With Searchable PDF Files Functionality

Swiss Firm Launches World's First Open Source Text Recognition With Searchable PDF Files Functionality
  • Digg del.icio.us reddit Facebook
With their launch of the ArchivistaBox 2008/IX, Archivista, a Swiss open source software company, has released the only open source text recognition software worldwide that can create searchable PDF files.

The majority of current text recognition or OCR (optical character recognition) programs run only on Windows systems and can be purchased for prices from around 100 Euro upwards.

When, however, thousands or millions of pages are to be processed, then expensive volume licenses, that are based on a price per scanned page, are required.

The ArchivistaBox is a web based DMS (document management system), that can be installed on every commercially available computer. Depending on the hardware used, the page volume processed can vary between several thousand up to several million pages per day.

Release of the 2008/IX marks the launch of the first open source text recognition system that is able to generate searchable PDF files directly from scanned pages. More than 20 languages are available and the recognition quality is comparable with that of commercial systems (>99 percent).

PDF files generated with the ArchivistaBox are stored in an Archivista database and automatically indexed, allowing the whole document stock can be researched. Documents scanned can be called up with a web-browser at any time.

Sensitive data can be encrypted before being made available. If required, the ArchivistaBox can create complete DVD publications.

100 % of the source code used in the ArchivistaBox comes under the GPLv2 license. Tesseract (including fracture / black-letter recognition) and the Linux port of Cuneiform (BSD licence) OCR engines are used for text recognition.

The hocr2pdf module (see http://www.exactcode.de) is used to generate the searchable PDF files.
Desire Athow

Posted by Desire Athow on 19 Sept. 2008

Désiré Athow is the Content Editor for ITProportal.com and has been writing tech articles for nearly a decade. You can follow him on Twitter.

Tags: Hardware, scanner