Typically an institutions will store executed contracts in image form and the text versions (e.g. MS Word) are either not always retained, or there is not enough confidence that the text version available is the final executed version of the contract (and may be a slightly earlier draft instead).
It is hard to harness the power of machine search and data extraction without therefore converting large portfolios of legacy documentation into machine readable form. This can be done using OCR (Optical Character Recognition) technology. Essentially a system analyses the structure of a document image, divide the pages into paragraphs and tables. These in turn are converted into their constituent lines, words and then characters. At this point, the characters singled out can be compared using sophisticated algorithms against various pattern images. A vast number of probabilistic hypotheses are computed, before presenting the statistically likely character stream.
By utilising certain technology steps and taking into account the language and drafting style typically used in legal contracts, it is possible to convert image-based documents into machine readable form in a cost-effective and accurate manner.