OCR and Electronic Text Collection

Who said you can't teach an old dog new tricks? We've taken dated OCR technology and brought it into modern times. Grooper's patented Synthetic OCR generates the most accurate text from images and electronic files, regardless of which OCR engine you use.

It all starts with image quality

Before any OCR action takes place, you'll want to make sure you're handing the OCR activity an image that is straight and free of artifacts. The key is to remove everything from the page that isn't text. Grooper lets you process images through a growing arsenal of exclusive tools and out-of-the-box profiles specifically designed for this task. The best part is these tools won't alter the original version of the image you want to permanently retain.

Examples

  • Remove lines
  • Ensure edges are clean
  • Remove small specks
  • Remove large non-text objects
  • Invert white-on-black zones
  • Remove hole punches
If at first you don't succeed...

Use Synthetic OCR

No matter how clean and pristine your images may appear, outdated OCR engines still have a difficult time collecting accurate text from images with multiple columns, different font sizes, and image shear. Grooper's patented OCR synthesis engine intelligently performs multiple passes of OCR on different portions of the image and Groops the results together as a single unit, keeping only the most accurate text results.

Iterative OCR

Iterative OCR is a technique we've developed as a way to capture text that the OCR engines simply miss the first time around. The idea is that we run a pass of OCR on the entire document, drop out any portions of the page where we were able to obtain text, then run another OCR iteration on the new image. With the new image having far less distractions, OCR is able to more clearly find text it missed during the previous passes.

Cellular Validation

Multi-column layouts present a unique challenge for OCR. Text on each side of a document may have different font sizes or the lines of text may be slightly offset from each other. A standard OCR process will have a complete breakdown of accuracy in one of the two sides. Grooper's Cellular Validation OCR splits the image into a grid of multiple areas and OCRs them independently. The result, industry-leading accuracy when it comes to OCRing your documents.

Segment Reprocessing

The final synthesis task is to perform segment analysis. A "segment" is a small block or line of text on a page. If any segment gets a low OCR confidence score, Grooper independently re-runs OCR on that segment to obtain optimum quality.

And when all else fails

We've got spell-correction

Powered by our Atomic RegEx engine, Grooper can perform OCR correction to fix some pretty ugly stuff.

Examples

  • Correct simple OCR mistakes in strings that don't match words in a language of your choice.
  • Fix existing, human-generated typos on documents.
  • Re-insert spaces where OCR falsely jammed multiple words together.
  • Delete strings of non alpha-numeric characters that resemble somebody's attempt at censorship, like "$#@! ^&*".
  • Repair numeric values where overly-aggressive image cleanup has inadvertently removed commas and periods.

Performance Balancing

Grooper's new “Run Speed” option gives you control to achieve an ideal balance between accuracy and performance.

Language Support

Grooper supports 35 languages, which can be individually enabled or disabled. Performs automatic language detection.

Electronic Text

Grooper avoids OCR altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly out of the file.

Our secret blend of

PDF Text Extraction

PDF has become the most widely used document standard in the world. With that adoption comes a variety of challenges you'll have to face in order to get the best text from every page. Some PDFs are purely text-based, others just images re-packaged into a PDF format, and yet others have combinations of the two scattered throughout pages.

Our hybrid approach

  • Grooper examines each page within a PDF to place the page into one of three categories: image-based, text-based, or mixed-content. Then each page is handled accordingly.
  • If a PDF page contains a single image which covers the entire page, it is considered an image-based page, and is processed using OCR.
  • If a PDF contains no images, we extract only the raw text-behind the page.
  • For mixed-content pages, each image on the page is extracted to a temporary image. Each temporary image is processed through OCR. Then the OCR results are merged with the native text.
Previous Image Optimization
Next Classification