OCR

  • HuggingFace’s FinePDFs Dataset Release


    The FinePDFs 📄 BookHuggingFace has released a comprehensive resource called the FinePDFs dataset, comprising 3 trillion tokens, aimed at benefiting the open-source community. This initiative includes insights into creating state-of-the-art PDF datasets, the relevance of older internet content, and the choice of RolmOCR for optical character recognition. Additionally, it discusses the most Claude-like open-source model and the surprising prominence of a horse racing site in the dataset's URL list. This matters because it advances the understanding and accessibility of PDF data processing for developers and researchers in the open-source community.

    Read Full Article: HuggingFace’s FinePDFs Dataset Release

  • API for Local Video Indexing in RAG Setups


    Built an API to index videos into embeddings—optimized for running RAG locallyAn innovative API has been developed to simplify video indexing for those running Retrieval-Augmented Generation (RAG) setups locally, addressing the challenge of effectively indexing video content without relying on cloud services. This API automates the preprocessing of videos by extracting transcripts, sampling frames, performing OCR, and creating embeddings, resulting in clean JSON outputs ready for local vector stores like Milvus or Weaviate. Key features include capturing both speech and visual content, timestamped chunks for easy video reference, and minimal dependencies to ensure lightweight processing. This tool is particularly useful for indexing internal or private videos, running semantic searches over video archives, and building local RAG agents that leverage video content, all while maintaining data privacy and control. Why this matters: This API offers a practical solution for efficiently managing and searching video content locally, enhancing capabilities for those using local LLMs and ensuring data privacy.

    Read Full Article: API for Local Video Indexing in RAG Setups

  • Comparing OCR Outputs: Unstructured, LlamaParse, Reducto


    Agentically compare OCR outputs of Unstructured, LlamaParse, Reducto, etc. side-by-sideHigh-quality OCR and document parsing are crucial for developing agents capable of reasoning over unstructured data, as there is rarely a universal solution that fits all scenarios. To address this, an AI Engineering agent has been enhanced to call and compare outputs from various document parsing models like Unstructured, LlamaParse, and Reducto, rendering them in a user-friendly manner. This capability allows for better decision-making in selecting the most suitable OCR provider for specific tasks. Additionally, the agent can execute batch jobs efficiently, demonstrated by processing 30 invoices in under a minute. This matters because it streamlines the process of selecting and utilizing the best OCR tools, enhancing the efficiency and accuracy of data processing tasks.

    Read Full Article: Comparing OCR Outputs: Unstructured, LlamaParse, Reducto