Skip to content

16.2.5

📅 2025-03-07

This release comes with a brand new page segmentation algorithm, which brings several improvements compared to the previous one, and increases the quality of text and layout detection in general.

The change is fully transparent. It is not necessary to update your integration to benefit from it.

Detection of documents with complex layouts

Section titled “Detection of documents with complex layouts”

The new page segmentation is expected to perform generally better than the previous one on documents with complex, unstructured or semi-structured layouts.

This would correspond to the following type of documents:

  • Invoices
  • Camera images
  • Flyers
  • Magazines

The structure of tables, or content organized as table, is improved with this new algorithm. Any type of document can benefit from this improvement, altough it will be more visible on invoices which are usually following an implicit grid-like layout.

Detection of inverted text (i.e. light text on dark background) is also improved. This type of pattern was a known weakness of the previous segmentation algorithm, and is now properly supported by the new one.

Increased maximum page size supported by OCR engine

Section titled “Increased maximum page size supported by OCR engine”

The new page segmentation implementation allowed to raise the maximum page size supported by the OCR engine, from 75 to 559 million pixels. This new limit allows, for instance, to recognize A0 pages scanned at 600 dpi.

From an integration point of view, the class CImageLimits has been updated to reflect this new threshold.

Lastly, the new segmentation algorithm is able to detect content overlapping each other, unlike the previous one. This case can be rather frequent for instance on magazine pages, with text printed on top of a picture; or with pictures inside table cells; etc.

The output module of iDRS has be adapted in order to properly support overlapping elements for all concerned output formats.

In a nutshell, this additional capability brings a better layout decomposition and will especially improve the visual quality of word processor outputs (docx, rtf, …​).

Previous iDRS HTML output has been fully replaced by a new XHTML output. The brand new XHTML writer used for this format provides the following improvements over the previous one:

  • Compliance with XHTML standard
  • Support of overlapping zones
  • More precise positionning of elements
  • Optimized CSS management
  • Better handling of UTF-8 characters

Performance of High Quality OCR for Japanese language

Section titled “Performance of High Quality OCR for Japanese language”

The Japanese High Quality OCR network has been updated to be more efficient in terms of memory consumption, and more performant in terms of speed. We were able to measure up to 40% of time savings on internal test sets!

This version includes an updated Korean engine that supports Hanja characters. Hanja are Chinese characters that were used as the writing script for the Korean language before the widespread adoption of Hangul. They are still used in modern Korean, for example, to represent names.

This iDRS release includes a consequent number of fixes and fine-tuning for DOCX output, especially concerning Editable and Exact layouts.

As a result, thanks to this fine-tuning effort and the new segmentation update, the quality of the DOCX conversion performed by iDRS has been significantly improved.

The behavior of the charset limitation feature has changed:

The OCR engine interprets an entire line using only the characters included in the charset, instead of replacing excluded characters with a reject character.

As a result:

  • Lines may or may not contain reject characters.
  • A line may be erased if the OCR cannot interpret a significant portion of this line due to the absence of characters excluded from the charset.
Internal IDDescriptionService desk IDs
IDRSRD-9741word spacing of DOCX output can be improved for Thai justified text
IDRSRD-9727Docx conversion results on customer test set should be improved
IDRSRD-9726Text lines sometimes are incorrectly merged as paragraphs in DOCX output
IDRSRD-9723Zonal OCR of a specific image may return empty results depending on the zone size
IDRSRD-9711Crash when loading a corrupted jpg image
IDRSRD-9705Page analysis allowed languages are not taken into account when language detection is turned off, resulting in reduced orientation detection accuracy
IDRSRD-9698The iDRS creates DOCX outputs with incorrect URI links
IDRSRD-9695The iDRS generates overlapping text results on a specific Japanese image
IDRSRD-9688The iDRS sometimes sets incorrect textbox right indent for Docx Editable output
IDRSRD-9683The iDRS XHTML NoLayout output can be improved
IDRSRD-9681OCR accuracy on 100dpi images is degraded with new page segmentation
IDRSRD-9679OCR engine freeze on an Arabic image
IDRSRD-9677The iDRS creates incorrect DOCX output when containing Top to Bottom text
IDRSRD-9656The iDRS new segmentation find text columns with zones going upwards
IDRSRD-9653The iDRS SDK crashes when running OCR on a specific document
IDRSRD-9647Crash when running OCR intel arch on arm macOS
IDRSRD-9640The new segmentation crashes when running OCR on chinese followed by japanese
IDRSRD-9634OCR engine returns some 0-sized elements when recognizing Arabic and Farsi documents
IDRSRD-9614CPageProcessing must be optimized
IDRSRD-9604iDRS16 .NET generates new object ids for provided array elements
IDRSRD-9593Update the iDRS to have new segmentation and overlapping zones activated by default
IDRSRD-9587The iDRS uses incorrect indentation when creating DOCX output with right-to-left text
IDRSRD-9585The iDRS should expose an option to downscale input if needed, when outputting Word document
IDRSRD-9580Text display of DOCX created with iDRS can be improved
IDRSRD-9569The new segmentation doesn’t recognize underscore symbolISD-35641
IDRSRD-9551Japanese HQOCR misses several characters with inverted colors on a specific image
IDRSRD-9536The new segmentation considers isolated dash characters as graphics
IDRSRD-9535Detection of table header row is incorrect on a specific image
IDRSRD-9526The new segmentation often misses comma signsISD-35429
IDRSRD-9521The iDRS cannot load large pdf documentISD-34080
IDRSRD-9495Header row of clear table is not properly recognized
IDRSRD-9487The new page segmentation crash when processing a specific imageISD-35253
IDRSRD-9486String class should support conversion from/to utf16-encoded strings using char16_t and wchar_t data types
IDRSRD-9469Detection of graphic lines is inaccurate
IDRSRD-9458Memory consumption of HQOCR Japanese is huge on a specific image
IDRSRD-9398Re-introduce support of Hanja in Korean OCRISD-36142
IDRSRD-9358Memory consumption of iDRS PDF loading can be improved
IDRSRD-9336Processing time for Japanese language is degraded with iDRS 16, compared to iDRS 15
IDRSRD-9323iDRS detects justified Korean text as several paragraphs containing a single characterISD-34474
IDRSRD-9301The new page segmentation makes substitution errors between ‘O’ and ‘0’ISD-34256
IDRSRD-9280The iDRS doesn’t properly write tabulation entries in DOCX Editable and Exact layouts
IDRSRD-9190The iDRS does not recognize clear text next to graphic zone
IDRSRD-9179The iDRS fails to detect columns on specific Korean imageISD-33936
IDRSRD-8301The iDRS misrecognizes . (dot) in specific Japanese documents
IDRSRD-6495The iDRS orientation detection gives wrong results on specific files.
IDRSRD-6450The iDRS 16 should be updated to zlib 1.3.1ISD-35377
IDRSRD-6433The iDRS incorrectly detects vertical text with zonal OCR, on a specific image
IDRSRD-6378Layout of docx output created by iDRS can be improved, when the document contains narrow columns of text
IDRSRD-6354Creation of XLSX with layout RecreateInput fails for a specific image when using new page segmentation
IDRSRD-6352The iDRS new page segmentation outputs extremely small font size for Hebrew characters
IDRSRD-6350The iDRS new page segmentation doesn’t handle a clearscan Hebrew document
IDRSRD-6345The iDRS detects non-existing I2OF5 barcodes on a specific documentISD-9456
IDRSRD-6335The iDRS wrongly detects the text of a specific table cellISD-31823
IDRSRD-6331The iDRS does not respect the tabulation when processing a pdf into docx format
IDRSRD-6320Paragraph spacing of DOCX created by iDRS are not correct when converting a specific image
IDRSRD-6292The iDRS should support images larger than 75M pixelsISD-31121
IDRSRD-6165A full page table is causing an unexpected page break when converting specific image to DOCX output
IDRSRD-6079Alignment of bullet lists in iDRS DOCX output is incorrect
IDRSRD-5933Hanja characters no longer part of the Korean charset with latest Korean OCR engineISD-36198, ISD-36142
IDRSRD-2988The iDRS does not detect the border line correctly when converting a TIF to docx.ISD-8043
Internal IDDescriptionService desk IDs
IDRSRD-9628Language detection feature requires really unexpected resources
IDRSRD-9392The new page segmentation breaks down clear pictures
IDRSRD-9754The iDRS is not compatible with VirtualBox VMs running on Windows Hosts

OCR resources required by language detection feature

Section titled “OCR resources required by language detection feature”

Currently, the language detection feature requires the OCR lexicon files (.ilex extensions) for all languages included in the allowed list (see property CPageAnalysisParams.AllowedLanguages). This issue will be fixed in the next iDRS release.

Note that if the allowed languages list is empty (default behavior), then all languages allowed by licensing are considered allowed.

The new page segmentation tends to create graphic zones around pictures with non-rectangular boundaries, while output would look better with rectangular shape.

The graphic zones boundaries detection will be reworked and improved in a future release.

This release has a compatibility issue with Oracle VirtualBox virtualization software, which prevents it to run properly on Windows host systems (whatever the guest system).

The competitor virtualization software VMware is however not impacted by this issue.

This will be addressed in the next release.