How to Train Your Own Extraction Model

Even though we already offer extraction models for many different document types, we realise that many work processes involve even more documents. That is why you can now create your own extraction model. You can determine yourself which document types are involved and which contents are to be extracted from them.

At natif.ai, we have many models that extract relevant information from documents such as invoices, receipts, vehicle registration, and delivery notes. For each of these types of documents, our models will extract the specific information needed for further processing, such as tax number, bank account, etc.
However, we understand that your business is not limited to these document types, and that extracting information from other documents will allow you to automate many more business processes. So instead of waiting for us to build a model for every document type – which may or may not include all the fields you want – you can get in the driver’s seat and build your own model!
Our new Train-Your-Own-Extraction-Model allows you to build a model that extracts the information you specify from the documents you want. In this post, we’ll walk you through the steps and show you how extracting information from your documents could not be easier!

Let’s Start

We start in the Workflow Overview of our platform and choose “Train Your Own Model Now”.

Select Your Workflow

Here you can find all our Custom AI Workflows. For our Custom Extraction Workflow, we select “Create Custom Extraction“.

Describe Your Custom Extraction Workflow

We start with describing our workflow by giving it a name and short description. You can also upload an image. This will help you to distinguish this workflow from the others.
In our case we want to create an extraction workflow for bank statements.

Define The Data Fields

Now we need to define all the data fields we want to extract from our documents.
This is the foundation of our model! The workflow will only extract the fields we create in this step.
You can choose between different kind of data fields such as text and numbers or advanced fields like tables.

All different data fields are explained in this tutorial.

Specify Your Documents

Now we have to give the AI some information about our documents so it knows which tasks need to be done. This also improves the accuracy of your workflow.

For a Custom Extraction Workflow the AI needs to know:
– Are the documents always perfectly cropped or should they be cropped in the workflow?
– Is the content on the documents in the Latin or Japanese alphabet?
– Is the text printed or handwritten? Or can it be both?

Your Workflow Is Created

Your model is ready but it’s not trained yet. It still needs your guidance to excel!
We now need to upload some training documents so the AI learns where to find which data fields. For this we select “Upload Training Data”.

Upload Your Training Data

You are now ready to upload your own data for the training. You can create several templates and upload your documents to them. In our case, we create one template per bank and upload all bank statements of each bank in there.
If your documents are very diverse and look different every time, you can also upload them without assigning a known template and the AI will learn to generalize across all layouts.

Hint: Uploading template by template will allow you to later obtain template specific metrics so you can investigate what works well and what does not, which in turn will help you to further optimise your extraction model.
Please upload a minimum of 5 documents per template or 50 diverse documents. It’s very important to select documents that are very similar to the documents that the model will process later.
This will help the AI get a full understanding of your documents and provide high accuracy processing.

Annotate Your Training Documents

Now we have to annotate our uploaded documents. That means we have to teach the AI where to find which data field in our documents.

On the left you can see all your defined data fields of the extraction model! To assign the text boxes, select the data field on the left and then click on the appropriate text box in the document. The matching colours will show you which fields are synced with which text box.
Data fields that belong to one group, such as the Account Holder, must be grouped together. You can group them easily by clicking the black button with the plus sign. The group will then be highlighted in the same colour.
To annotate large tables like the transactions, we start with matching all corresponding text boxes to each of the data fields. In our case this means that we start column by column.

Once they are all assigned, we can start to group each transaction!
Therefore, we have to create one group for each transaction, which means for each row of our table. To create multiple groups, select the green button with the plus sign at the bottom. Using drag and drop, we can now simply select all the data fields of the first row!
Once all data fields are assigned, you can save the annotations and repeat this step for all other uploaded documents.

Start The AI Training

Once you are done with annotating the documents, you can start the training.
This means the AI now learns how to process your documents.
You will receive an email once the training is completed – which is normally within the next 24 hours!

Integrate Your API

However, your workflow API is already ready and can be integrated! You can find all information such as code snippets and JSON response examples in the workflow documentation.

That’s It!

Your API will automatically be adjusted when the training is completed! The training metrics will provide you with more information about the accuracy of your AI Workflow.

If you need support for creating your own extraction model, just contact us and let us how we can help you.