DocuLite

  • Synthetic Data Boosts Financial Document Parsing


    We trained a 7B model (OpenChat) on synthetic OCR data to beat public dataset benchmarks on financial docs. (Paper + Method inside)Researchers have tackled the Privacy Paradox in Financial Document Understanding (FDU) by developing synthetic data generators to train models without using real client data. They created DocuLite, a framework with InvoicePy and TemplatePy, to generate complex synthetic OCR text and HTML-based invoice templates. These synthetic datasets were used to train models like OpenChat-3.5 and InternVL-2, resulting in significant improvements in F1 scores compared to models trained on conventional public datasets. This approach suggests that investing in synthetic data generation can be more effective for building document parsers in sensitive domains like finance and healthcare. This matters because it provides a privacy-compliant method to improve machine learning models for financial document processing.

    Read Full Article: Synthetic Data Boosts Financial Document Parsing