**London**: Snowflake’s Document AI combines OCR and large language models to automate extraction of healthcare data from scanned documents, significantly reducing manual workloads and costs. IntelyCare’s recent flu vaccine documentation project highlights practical benefits and challenges of this innovative data digitisation technology.
In recent developments within data technology, Document Artificial Intelligence (AI) tools have emerged as transformative solutions for converting physical documents into structured, digital data. A detailed examination of Snowflake’s Document AI platform highlights how this technology combines Optical Character Recognition (OCR) with Large Language Models (LLMs) to efficiently extract information from scanned documents and seamlessly integrate it into databases.
Document AI is designed to process “documents” broadly defined as images containing text. This capability extends beyond traditional handling of tabular data, JSON, XML feeds, and images, by enabling users to digitise information trapped in paper forms. The integration of OCR and LLM technologies allows a user to craft specific prompts and apply them to large document collections simultaneously. Snowflake’s implementation optimises these models by allowing users to fine-tune the LLM through iterative review and training cycles, making the development process accessible even to those who may not consider themselves expert data scientists.
The process involves uploading scanned documents, fine-tuning model prompts, running the model across thousands of documents, and then saving the extracted information into structured tables. This approach offers a streamlined alternative to traditional, complex data extraction workflows which typically require extensive coding and multiple steps such as data cleaning, feature engineering, and model training.
A practical application of Snowflake’s Document AI was demonstrated by IntelyCare, a healthcare staffing company that manages clinician documentation, including flu vaccination records. In the 2023 flu season, IntelyCare handled over 10,000 flu-shot documents, previously reviewed entirely through manual checks. By leveraging Document AI, the company was able to automate approximately half of the flu-shot submissions, significantly reducing manual review workload within a matter of weeks.
The workflow used by IntelyCare involved several stages:
– Uploading a large volume of flu-shot documents to Snowflake’s platform.
– Refining prompts and iteratively training the LLM for optimal accuracy.
– Developing logic to cross-check extracted data against clinician profiles, such as matching names and verifying dates.
– Implementing decision logic to automatically approve valid documents or route uncertain cases back to human reviewers.
– Testing the system rigorously to minimise false positive approvals, due to the business risk they pose, while allowing false negatives to be flagged for manual follow-up without adverse effects.
Throughout the process, IntelyCare’s team found that the Document AI model performs best when focused solely on extracting data visible on the documents, rather than making interpretative decisions or calculations. For example, instead of asking if a document was expired, the model was prompted to simply extract the expiration date, leaving any validity checks to be processed downstream. This refined use of prompt design improved the reliability of the system and minimised errors.
The team also noted challenges related to training data selection to prevent overfitting. Repeated examples from the same clinician led to biased outputs, which were alleviated by carefully curating the training set. Additionally, while the model excelled in interpreting clear, typed text, it could be misled by handwritten notes on surfaces like napkins. Current limitations include the lack of in-built support for document image embeddings within Snowflake, which would help identify documents that deviate significantly from normal submissions and flag them for human inspection.
Despite initial concerns about cost, the team reported that using Snowflake’s platform for Document AI was more affordable than anticipated. Training and deploying the model to process thousands of documents incurred expenses below $100 per week, a figure that is offset by savings from reduced manual review labour. Overall, the automated review process halved the human workload and cut associated costs by around 40%, demonstrating a compelling business case for the technology’s adoption.
The rapid development timeline—shifting from potential months to mere days for such an extensive project—exemplifies how Document AI can accelerate data digitisation efforts. Following the successful flu shot review initiative, IntelyCare has applied similar models to other healthcare documents with promising results.
Snowflake’s Document AI thus represents a significant step forward in bridging physical document archives with digital data ecosystems. While some advanced features like document image embeddings remain forthcoming, the platform already offers practical and cost-effective capabilities for various industries managing large volumes of paperwork.
Source: Noah Wire Services