Getting Started with Databricks
Getting Started with Databricks
Datavolo helps teams build multimodal data pipelines to support their organization’s AI initiatives.
Every organization has their own private datasets, but extracting value from unstructured datasets has proved difficult in the past. The rise of AI applications and the use of a new pattern, retrieval augmented generation (RAG) provides a way to unlock the hidden value within these datasets. RAG details a robust way to feed LLMs with up-to-date, reliable, and pertinent information. RAG ensures AI applications receive the necessary context needed to mitigate hallucinations and increase the accuracy and relevance of responses. By employing RAG, user's can ask complex queries about their data and instantly receive high quality responses.
Datavolo Integration with Databricks
Databricks has cemented itself as a leader in the AI space by including multiple integrations with LLMs and vector embeddings. At Datavolo, we are excited to be partners with Databricks and provide customers a clear path for ingesting their data into Databricks in a secure, scalable, and foolproof way. For our integration with Databricks, we have created multiple custom processors that work hand in hand with Unity Catalog and Databricks SQL. These processors, along with our additional advanced PDF parsing and chunking processors, provide an end-to-end process for ingesting unstructured data into the Databricks platform. We have provided an example below that shows how to pre-process and load financial documents into Databricks's Delta tables on Unity Catalog. Putting the financial data into Delta Tables makes it queryable and allows users to run Databricks' AI models and applications on the data. There are three main phases to this integration example: Unstructured document parsing, chunking and pushing data to Databricks, and finally orchestrating Databricks SQL operations. This demonstrates key components of the Datavolo-Databricks partnership.
Unstructured Data Processing
Overview
As previously mentioned, Datavolo is optimized to process unstructured documents, such as 10-K and 10-Q reports. This example will use PDF formats since that is common for these documents. The processing workflow involves several stages: document ingestion, pre-processing, loading metadata, parsing, structuring, and transferring data to Databricks for further analysis.
Data Ingestion and Pre-Processing
- Document Ingestion:
In our data ingestion phase, we focus on efficiently and securely retrieving financial documents stored in an S3 bucket. These documents, in this case, 10-K and 10-Q reports, are ingested using NiFi’s FetchS3Object processor. This processor is configured to handle authentication and download tasks, ensuring that each document is available for further processing. This step is crucial as it sets the foundation for subsequent processing by ensuring that all relevant documents are gathered from a central repository. The use of S3 as a data source allows for scalable and reliable storage, which is essential for handling large volumes of financial documents. The processor can be configured to listen for any new documents that are uploaded to S3. This allows the ingestion pipeline to operate in a streaming fashion and process data the moment it becomes available.
- Initial Processing Steps:
File Validation and Routing:
After ingestion, the next step is to collect metadata as well as validate and route the files. The RouteOnAttribute processor is used to filter out any empty or corrupted files. This ensures that only valid documents proceed to the next stage of processing. The UpdateAttribute processor is used to record valuable metadata about the documents such as the original source, collection method, and time of ingest. Valid files are then routed to appropriate processing paths based on predefined criteria. The raw financial docs are archived onto Unity Catalog in case we want to reprocess the data.
PDF Parsing:
Having raw financial PDF files is not very useful. The value within the data is still masked behind a complex and unstructured PDF format and parsing this format is no trivial task. To address this, Datavolo provides a ParsePdfDocument processor to extract various types of content from the PDF documents. This includes text extraction, image extraction, and table extraction. Each type of content is processed separately:
Text Extraction: Extracts textual content for further text-based analysis. Image Extraction: Extracts images which are then processed to generate descriptive text. Table Extraction: Extracts tables from the documents, converting them into a structured format like CSV.
PDF parsing is complex under the hood. Embedded tables often contain critical details, but extracting these details from financial documents is difficult when each report uses multiple different table formats. There is also valuable details in accompanying images in the PDFs and conventional document processing methods will drop these details. Fortunately Datavolo provides a simple interface to this complex step. The ParsePdfDocument processor is provided to split the documents into its text, table, and image elements. A ParseTableImage processor can extract details from complex table layouts. The PromptOpenAI processor is used to describe the details within images. Both the ParsePdfDocument and ParseTableImage processors connect to Datavolo's microservice architecture. These microservices provide a flexible and scalable way to process documents.
- Data Structuring:
After parsing, the extracted components (text, images, tables) are merged into a single JSON document. This merging process is vital to maintain the context and relationships between different parts of the document. For instance, textual references to tables or images must be preserved to ensure accurate data representation. The JSON document is uploaded to Unity Catalog in case any other teams want access to the parsed documents. The textual content within the JSON documents still needs to be chunked into manageable sizes to support RAG. For this step Datavolo's ChunkDocument processor is used. This is essential for detailed processing and analysis, allowing for more precise queries during RAG. This processor allows for advanced chunking strategies such as Section (includes Subsection strategy), Paragraph and Sentence. You can also use a naive chunking stategy for x number of tokens.
- Financial Metadata
The pipeline is also configured to extract two critical details from the docs, the company's stock ticker symbol, and the document's fiscal year. These two attributes are critical to have when querying the data. The ExtractText processor can run regex expressions to find the ticker symbol and fiscal year. Additionally, PromptOpenAI is used to extract these values from the first few pages of the docs when the regex expressions did not match.
- Data Transfer to Databricks:
The final step in the data ingestion and pre-processing phase involves inserting document metadata and the processed chunks to Databricks Delta Tables. The PutDatabricksSQL processor is used to push records into Databricks Delta Tables. A document
table stores a record for each financial document ingested, including its stock ticker symbol, fiscal year, source location, and ingestion time. A chunks
table holds the text chunks and metadata associated with these chunks from step 3. This seamless integration is crucial for enabling advanced analytics and insights on the ingested data.
Processing in Databricks
Datavolo provides a way to pre-process and ingest data onto the Databricks Unity Catalog. Below we summarize how our demo pushes data into each layer of Databrick's medallion architecture.
Landing Data in Databricks
- Bronze Layer:
In the Bronze layer of Delta Lake, we store raw, unfiltered data. This includes the original PDF documents and the initial structured JSON documents. The primary purpose of this layer is to provide an archival point for all ingested data, preserving the original documents. This archive is critical for maintaining a complete record of all ingested documents, ensuring that any reprocessing or validation can refer back to the original source data.
Transformations and Enrichment
- Silver Layer:
In the Silver layer, the focus shifts to refining and enriching the data. The data also needs to be inserted in a structured and queryable format. Two tables are setup to hold data in a structured format: A document
table holds metadata about each financial document. A chunks
table holds text chunks and metadata extracted from each document. The chunks
table contains a foreign key to the documents
to link the datasets together.
Populating these tables involves a few steps:
Metadata Registration: A record in the documents
table is inserted for each financial document encountered. The record includes the company's stock ticker, fiscal year, source, and other relevant metadata. The data is enriched with the extracted stock ticker and fiscal year which are important for identifying the document and filtering queries. The document source, type, and ingestion time are crucial for maintaining data lineage and traceability, ensuring that every piece of data can be traced back to its source.
Parsed Data Storage: The JSON documents are further processed to extract text chunks, which are then stored separately in a chunks
table. The chunks
table includes the extracted text, the source document id, the page number in the document where the chunk was derived, and other metadata like ingestion time.
This step ensures that the data is cleaned, standardized, and enriched, making it ready for more detailed analysis.
Final Data Structuring
- Gold Layer:
The final stage of data processing occurs in the Gold layer. Here, data from the Silver layer is aggregated into a comprehensive table. This table represents the fully processed and enriched data, ready for advanced analytics and reporting. This table also leverages Databricks vector index to create embeddings over the chunked text column. Embeddings on the chunked text are needed to employ RAG.
The final table includes several key fields:
Chunk ID: A unique identifier for each text chunk. Document ID: Links the chunk back to the original document. Document Source: Indicates where the document originated. Page Number: Identifies the page from which the chunk was extracted. Text: The actual content of the text chunk. Embedding: The generated embedding for advanced search capabilities and RAG. Fiscal Year: The fiscal year the document pertains to Ticker Symbol: The ticker symbol belonging to the company
This structured data allows for detailed and accurate analysis, providing valuable insights from the processed documents.
Practical Applications
Querying Financial Data
Leveraging the structured data in the Gold layer, users can perform complex queries on the financial documents. For instance, queries can be run to determine the total revenue or sales of a company for a specified fiscal year. This capability is particularly valuable for financial analysts and researchers who need to extract precise information from large volumes of data.
Ensuring Data Traceability
Every piece of data processed through this flow maintains lineage information. This ensures that users can trace back to the original document and verify the source of any extracted information. This traceability is crucial for maintaining data integrity and trust, especially when dealing with financial documents where accuracy is paramount.
Supported Cloud Providers
Datavolo currently offers support for multiple cloud platforms, providing flexible options for unstructured data ingestion and processing. Our support includes:
-
Container Orchestration Platforms:
- Amazon Web Services (AWS) Elastic Container Service (ECS) - Currently supported
- Google Cloud Platform (GCP) Google Kubernetes Engine (GKE) - Planned support by end of Q2 2024
- Microsoft Azure Kubernetes Service (AKS) - Planned support following GKE integration
-
Infrastructure as a Service (IaaS) Platforms: We currently support the following IaaS providers:
- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- Microsoft Azure
Our multi-cloud support strategy ensures that customers have the flexibility to choose the cloud environment that best suits their needs and existing infrastructure. This approach allows organizations to leverage their preferred cloud provider while benefiting from Datavolo's powerful unstructured data processing capabilities. For customers with specific cloud requirements or those operating in multi-cloud environments, we recommend contacting our support team to discuss the best deployment options for your use case. Please note that while we are actively expanding our cloud provider support, the availability of certain features may vary across platforms. We continuously update our offerings to provide a consistent experience across all supported cloud environments.
Conclusion
We have provided a demo showing how to process and unlock value from complex financial data. We reviewed the steps necessary to pre-process, parse, and extract details from complex PDF documents. We described how Datavolo integrates with Databricks' Unity Catalog and Delta Tables. Datavolo, powered by Apache NiFi, combined with Databricks' data processing capabilities, provides a robust solution for handling unstructured documents. This integration not only streamlines the data processing workflow but also enriches the data to support advanced analytics and queries. Datavolo and Databricks provides a complete end-to-end system for ingesting, processing, and making unstructured data accessible to the end user.