Using Azure AI Document Intelligence with Microsoft Fabric

In this post, you'll learn how to create a Document Intelligence solution using Microsoft Fabric. The solution uses Azure AI to extract text from scanned documents. Using a Spark Notebook and Synapse ML, we can simply import scanned document data into a data lake solution.

Azure AI Document Intelligence (previously known as Form Recognizer) is a service that reads documents and forms. It employs machine learning to evaluate documents stored in various formats, like as JPEG and PDF, and extracts structured data from them.

In a previous post, we covered how to use Azure AI Document Intelligence using the direct Azure REST API. In this post, we'll use Document Intelligence with the Python SDK from a Jupyter Notebook.

Unlike previous postings on language translation and sentiment analysis in Fabric Notebooks, in this example, we'll call Azure AI directly without the help of Synapse ML.

 

About Document Intelligence

Azure AI Document Intelligence accepts documents in a number of formats (e.g., JPEG, PDF, TIFF, etc.) and uses trained models to evaluate incoming photos, identify data fields in the images, and output data from the images as JSON objects.

Document Intelligence makes use of machine learning models that have been trained to "look" for specific types of data, such as the well-known IRS W-2 form format. Custom Document Intelligence models can be trained to recognize data from custom business forms.

 

The Input Data

 

The input data for this solution will be a collection of US IRS W-2 Forms in JPEG format.  W-2 Forms are issued by US-based employers to report to employees (and the federal government) employee wages, tax withholdings and other financial information used in calculating final employee income tax liabilities.

Here's an example of a W-2, which we'll use as input for the solution:

 

 

Create an Azure AI Service

We start by creating an Azure Document Intelligence service (or a multi-service service) in Azure.

 

 

From the Azure AI Service, we need to record the endpoint and one of the KEY values.  We'll use these values in Python code in the Jupyter notebook.

 

 

Store Keys in Key-Vault

As a best practice, keep your keys in Azure Key Vault. Key Vault offers underlying keys to requesting processes according on the authorization provided by Azure Entra ID.

In the case of Fabric, a user can be granted the authorization to use a secret key within a notebook without knowing the actual value of the key

 

 

Authoring the Jupyter Notebook

With the Key Vault and Azure AI Service in place, we can move forward to creating code in the Jupyter Notebook.

 

Fetching the Azure AI Key from Azure Key Vault

Once keys have been placed in the key vault, they may be retrieved into the notebook session using the Fabric PyTridentTokenLibrary requirement.

 

 

When this cell completes, the Azure AI key will be stored in ai_services_key session variable.  This assignment is the result of the call to the get_secret_with_token(...) API call.

The Main Processing Loop

 

The primary processing loop for this solution is shown below, along with a description of the major portions of the code.

 

 

On line 4, we load all of the W-2 document images (JPEG files) from the source folder into the Data Lake Files section. The load returns a DataFrame with one row per image. Each row has numerous metadata columns, as well as one column containing a byte array of file contents.

Line 6 iterates across the DataFrame, processing one image at a time.

Lines 8 and 9 retrieve the file location and byte array (blob) from the DataFrame row.

Line 14-17 converts the byte array (blob) to a base64 ASCII string, which is then embedded into a JSON object with the key base64Source. This is the payload format required by the Azure AI Document Intelligence Analysis endpoint.

Line 20 invokes a function (described below) that delivers the JSON payload to Azure AI Document Intelligence for processing, waits for a response, and then returns the data found in the image's form as output.

Line 23 saves the form data discovered by Document Intelligence to the Data Lake as a Delta table.

Lines 25-36 clean up processed picture files in the Data Lake by relocating them to an "archive" folder.

The implementation of this process as a single-threaded, synchronous flow was chosen to make the example clear and easy-to-follow. In a production, scalable solution, submissions to Azure AI Services and ingestion of analyze results should be done asynchronously to make efficient use of Spark cluster compute resources.

 

How We Call Azure AI Services

 

In the main loop (above) we call a function analyze_tax_us_w2, which is a function we wrote to do the following:

  1. Call Azure AI Services using the key (from key vault), the file (in base64 string format), and the identifier of the document intelligence model we want the image processed with (prebuilt-tax.us.w2).  
  2. Read the resulting JSON response payload.
  3. Extract the fields from the JSON payload that we targeted for saving in the Data Lake.

 

The source is somewhat lengthy, so I'll just highlight the most essential conceptual points. If you want to review the complete source file, watch the movie or read the Notebook source on GitHub.

 

 

The top art of the analyze_tax_us_w2 function calls Azure AI at lines 7-11. There is a lot going on in three lines of code.

  • Line 7 creates a client used to make API calls to Azure AI Services.  Note that the endpoint and crediential are provided. The credential was fetched from Azure Key vault.
  • Line 10 invokes begin_analyze_document to make a POST call to Azure AI Services. Its paramters are (1) the name of the model to use when evaluating the image--prebuilt-tax.us.w2, and (2) the blob, in base64 string format.
  • The return from begin_analyze_document is a poller, which is used to poll the Azure AI GET endpoint until the image is analyzed and a final result of the call is received.
  • Line 11 is a synchronous wait until a final response to the request is received from Azure AI Services.  The return (w2s) is a list of documents found in the image. Note that while the images in this example have only one W2 form each, an image could have more than one form.
  •  

It should be noted that polling for the result of a request is optional. Azure AI Services maintains request output for a retention period, and a highly scalable architecture would submit several requests and retrieve the results at a later time rather than waiting for Azure's batch processing to complete in real time.

After the JSON response is received, the function completes by parsing the nested JSON response for the fields we need for the Delta Table we're creating later in the process.

 

 

When the response is parsed into a row format, the main loop calls our save_batch_to_table function to write the data extracted from the JPEG form to a Delta Table in the Data Lake.

 

 

Writing the Output Table

Once all the data from all images are received from Azure AI, extracted into our format design and added to our Delta table, we can use the data and analyze it as we would any other type of data in the database.