16.2.0
📅 2024-04-09
New features
Section titled “New features”N/A
Improvements
Section titled “Improvements”Improved Excel output
Section titled “Improved Excel output”With this release, the layout and content of XLSX documents reach a higher level of quality.
This is made possible by:
New OCR property: TableDetectionMode
Section titled “New OCR property: TableDetectionMode”A new property has been implemented: COcrPageParams.TableDetectionMode.
This property lets you change the way tables are detected during the OCR step.
There are three possible behaviors:
-
TableDetectionMode.Automatic: the system analyses the contents of the page and decides whether or not there are tables on the page.
This is the default behavior, similar to the previous release. -
TableDetectionMode.ForceSingleTable: the system tries to interpret the entire page as a single table.- This option is very useful if you already know at the time of OCR that you are going to convert your document to XLSX.
- If you need to convert to formats other than XLSX, it may be preferable to perform two OCR operations, one with
TableDetectionMode.ForceSingleTable(for XLSX) and the other withTableDetectionMode.Automatic(for other output formats). - Note that this mode may fail to compute a single grid for the content of the input page. This should not happen for documents structured as tables (which is the target use case for XLSX conversion), but may occur for more complex documents or if there is a perspective angle. If the computation of a single grid fails on a given page, the system will fallback to the
Automaticmode for that page. - A known limitation of the single table mode is that Asian text written in a top-to-bottom direction will not be detected as such, and will therefore be split over several cells. We will investigate how to remove this limitation in the future.
-
TableDetectionMode.Disabled: the system prevent detection of tables in documents.
Improved presentation of “XLSX RecreateInput”
Section titled “Improved presentation of “XLSX RecreateInput””In addition, several fixes and fine-tuning have been made for the SpreadsheetLayout.RecreateInput layout to improve visual quality.
In summary, the following changes have been implemented:
-
For cells:
- The computation of text alignment and indentation has been optimized.
- Numeric values (including amounts) are correctly detected and the cell format is adjusted accordingly.
- Text wrapping has been disabled to avoid potential hidden content.
-
For textboxes:
- Positioning and dimensions have been improved to better match the input file.
- Text positioning in the textbox has been corrected to fit correctly, avoiding text breaks.
- Text indentation in the textboxes has been reviewed.
- The background color of textboxes will be applied unless t is likely to obscure other elements of the page.
Fixed bugs
Section titled “Fixed bugs”| ID | Description |
|---|---|
| IDRSRD-9227 | IDRSRD-8280 The cell’s left/right padding computed during OCR should be taken into account for XLSX output |
| IDRSRD-9199 | The iDRS doesn’t recognize properly specific words on several documents |
| IDRSRD-9188 | the iDRS always runs full page barcode detection when a work image is set in the page |
| IDRSRD-9177 | idrspdf16.dll is missing its version information |
| IDRSRD-9174 | The iDRS should use paragraph margins into account for textboxes dimensioning and positioning in XLSX output |
| IDRSRD-9171 | The iDRS generates office documents with incorrect textboxes positioning |
| IDRSRD-9168 | The iDRS throws an exception when converting specific documents to DOCX using the new segmentation |
| IDRSRD-9158 | The iDRS throws an exception during OCR when processing a specific image |
| IDRSRD-8310 | The iDRS takes a huge time while detecting qrcodes on a specific image |
| IDRSRD-8306 | The iDRS throws an exception when recognizing a specific image |
| IDRSRD-8304 | The iDRS throws an exception when creating a PDF with document separation criteria, from a specific set of pages |
| IDRSRD-8298 | The iDRS throws an exception when recognizing specific images with new segmentation |
| IDRSRD-8296 | The iDRS throws an exception when converting a specific image to XLSX |
| IDRSRD-8284 | The iDRS recognizes ‘q’ instead of ‘g’ if the descender touches an underline, when recognizing a specific image |
| IDRSRD-8281 | Content of textboxes in XLSX output may span on the next line |
| IDRSRD-7467 | The documentation page describing the set of files needed for language detection feature is incorrect |
| IDRSRD-7094 | XLSX cells containing a value + amount should be registered as numeric content |
| IDRSRD-7023 | The iDRS should keep binarized image after OCR whenever possible |
| IDRSRD-6853 | The iDRS detects graphic shapes with a zero pixels height on a specific image |
| IDRSRD-6786 | The iDRS crashes when recognizing a specific image |
| IDRSRD-6504 | The iDRS freezes when running OCR on a small image with ThreadingMode activated |
| IDRSRD-6465 | Polygon inspection helpers are not exposed anymore in CPolygon |
| IDRSRD-6462 | iDRS new page segmentation provides incorrect line coordinates and takes a long time |
| IDRSRD-6457 | The iDRS with new page segmentation doesn’t detect a specific zone |
| IDRSRD-6449 | iDRS OCR results vary when CPageRecognition is reused |
| IDRSRD-6442 | The iDRS crashes when reading a 1px height zonal barcode |
| IDRSRD-6415 | .NET CIDRSException should hold relevant info in its Message property |
| IDRSRD-6384 | Reaching iDRS maximum memory limit during Arabic OCR causes a memory leak |
| IDRSRD-6382 | The iDRS throws an exception when processing a specific image |