1. Tasks
  2. Analyze Document

Tasks

Analyze Document

The Analyze Document task provides various methods of extracting information from documents, such as Optical Character Recognition (OCR), entity extraction, ID proofing, and classification.

Use Cases

  • Use this task to extract text from a document
  • Use to extract a data structure of all entities in a filled in form
  • Use to extract data from a South African ID document
  • Use to classify a proof of payment document as belonging to a specific bank in South Africa
NOTE

If you are specifically wanted to extract financial information from an invoice or receipt, you should use the Analyze Financial File task.

Basic usage

loading...

Parameters

document_processor_type
required
string

The type of analysis that you would like to perform on the file.

Current Options are:

  • ocr
    Extract text from a document with optical character recognition. The document can be a pdf or an image.
  • entity_extraction
    Extract entities and form fields from a document. The document can be a pdf or an image.
  • id_proofing
    Id proof a document, getting verification signals on image manipulation, verification of an id document and checking for suspicious words. The document can be a pdf or an image.
  • classifier
    Classify a proof of payment document as belonging to a specific bank in South Africa. The document can be a pdf or an image.
  • parse_contents
    Parse the contents of a document. The document can be a pdf, pptx, docx, html, xml.

[document_processor_type]
required
object

This object key must correspond with the document_processor_type.
This object will contain all the options and params for the selected document_processor_type.

For example, if the document_processor_type is ocr, then there must be an ocr object with the required options for the ocr processor type.

Show child attributes

params

optional
object

This object contains all the options and params for the selected document_processor_type.

files

required
object[]

An array of file objects. Each object contains file meta data.
The most important key for these objects is the fileuuid

Show child attributes
fileuuid
required
string

The unique identifier of the file that should be transcribed.


Result

loading...

Properties

payload

object

The returned payload


total_pages

integer

Number of pages analyzed


ocr

json object

Contains all the analysis data for the ocr document_processor_type. This object with change to correspond to the document_processor type.


ocr.full_text

string

Contains all the text the analysis could get from the file


ocr.detected_languages

json array

An array of objects. Each object in the array is a page of the file and indicates which languages were detected on that page of the document.


ocr.all_languages

json array

An array containing language codes for all languages detected in the document.


OCR

Shown in the Basic Usage

Entity Extraction

This document_processor_type processor type is used to extract entity and form fields from a document.

Example would be a document that has an organization and fields that have information in them like a receipt, invoice, application form, etc.

It will attempt to identify the field name and the value of that field and create a key value pair. It will do this for each form field that it detects.

Here is an example of a form. Its a Fax Cover Sheet with a form with fields and entries for each field. Form Extraction Example

On the left is what it has extracted from the image displayed on the right.

loading...

You will notice the document_processor_type is now entity_extraction and so there should be the corresponding object with the same name referring to the file it should analyze.

Result

The result returned would look something like this. The returned data is quite large so only showing a sample of the important data that is returned.

loading...

The result payload is similar to the first example with a few additions:

entity_extraction

json object

Contains all the analysis data for the ocr document_processor_type. This object with change to correspond to the document_processor type.

Please see above OCR example for explanations for the detected_languages and full_text objects & key values.


entity_extraction.entities

array

Contains an object for each entity detected. eg. organization is: Staples


entity_extraction.form_fields.pages

array

Contains an object for each pages.

Each page is an object in the array and each page object contains an object for each field detected. eg. Date: 04/12/2021 From: Jane Doe


ID Proofing

This is used to extract all the information from South African Identity documents, ie. Green bar code ID Book, Smart ID Card or Passports.

It doesn't only extract the information, it also does fraud checks to indicate if the document is valid or fraudulent.

loading...

You will notice the document_processor_type is now id_proofing and so there should be the corresponding object with the same name referring to the file it should analyze.

Result

The result returned would look something like this. The returned data is quite large so only showing a sample of the important data that is returned.

loading...

The result payload is similar to the first example with a few additions:

id_proofing

json object

Contains all the analysis data for the id_proofing document_processor_type. This object with change to correspond to the document_processor type.

Please see above OCR example for explanations for the detected_languages and full_text objects & key values.


id_proofing.id_proofing_signals

object

Contains an entry for each id proofing check performed and its result. eg. fraud_signals_suspicious_words: PASS

This indicates the the suspicious words check passed inspection. No suspicious words found.


id_proofing.all_signals_passed

boolean

Indicates if all the id proofing checks passed. If one check fails, this will be false.


Parse Contents

This document_processor_type is used to parse the text from a document. The document can be a pdf, pptx, docx, html, xml.

It uses a Large Language Model (LLM) to parse the text from the document.

loading...

Parameters

These parameters are specific to the parse_contents processor type.

parse_contents.sites

optional
object[]

An array of site objects. Each object contains site meta data.
The most important key for these objects is the url.

Only sites or files can be used, not both.

Show child attributes
url
required
string

The URL of the site that should be parsed. Will only parse this specific page.

parse_contents.params

optional
object

The type of analysis that you would like to perform on the file.

Show child attributes

parse_mode

optional
string

The mode of parsing to use. One of fast, accurate, premium

Default: accurate

fast | Best for text-only PDFs.
accurate | Performs OCR, image extraction, and table/heading identification. Ideal for complex documents with images.
premium | Same features as accurate, and outputs equations in LaTeX format. Outputs diagrams in Mermaid format.

result_type

optional
string

The type of result to return. One of text, markdown, json.

Default: markdown

markdown and json are only available in accurate and premium mode.

parsing_instruction

optional
string

The prompt to provide to the LLM to help it understand the context of the document. Applied to each page of the parsed document.

disable_ocr

optional
boolean

Skip the OCR step. Useful for text-only PDFs as it will speed up the parsing process.

skip_diagonal_text

optional
boolean

Skip any text that is not horizontal or vertical.

do_not_unroll_columns

optional
boolean

By default multiple columns of text are parsed as a single continuous block. Set this to true to keep the columns separate.

Result

The result returned would look something like this. The returned data is quite large so only showing a sample of the important data that is returned.

loading...

full_text

string

Contains all the parsed text from the document. Typically in the format specified by result_type.