Extracting Insights from Unstructured Financial Documents with LLMs (2024)

Jun 5, 2024

Process hundreds of thousands of financial statements, like 10k 10q reports, and deliver financial insights to your customers.

Extracting data from large amounts of SEC Forms and other financial reports can be highly lucrative. The following article explains the different types of reports and what technologies and tools we can use to extract insights from them. 

Types of financial reports and what you can extract

The Securities and Exchange Commission (SEC) requires public companies to publish various documents that analysts and investors can use to evaluate a company and its growth. 

Form 10-K

Form 10-K is an annual report that provides a comprehensive overview of a company's financial performance. It includes audited financial statements, such as the income statement, balance sheet, and cash flow statement. These are typically presented in a clear, tabular format.

Beyond the tables, the report contains narrative sections like Management’s Discussion and Analysis (MD&A), where executives discuss the fiscal year in detail. These sections provide insights into the company’s operations, liquidity, and future outlook.

Legacy technology like OCR (Optical Character Recognition) can help extract simple tabular data from financial documents. However, using these methods, valuable information remains hidden within the narrated sections of the financial report. Later in this article, we’ll explain how tools like Trellis allow you to use LLMs to extract complex insights from these sections.

Form 10-Q

You can think of the Form 10-Q report as the quarterly, less comprehensive version of the 10-K. It reports updated financial statements and a written overview of the company’s performance over the past quarter. 

Like the 10-K, it includes tabular data such as quarterly financial results and textual sections that cover management's perspective on the financial results, including potential risks and uncertainties facing the company.

Form 8-K

Form 8-K reports significant events or corporate changes that could be important to shareholders or the SEC. These changes include acquisition announcements, changes in leadership, and other significant operational shifts.

The data in Form 8-K is not typically structured in standard tables but can include financial data related to specific events, usually detailed in written form.

Other Report Types

Other SEC filings include proxy statements (DEF 14A), which provide disclosures on executive compensation, shareholder proposals, and corporate governance. These documents often contain tabular data regarding compensation packages and written descriptions of corporate governance policies. 
Additionally, specialized reports like Form 20-F for foreign companies and Form S-1 for initial public offerings contain a mix of tabular financial data and extensive explanations of business operations, strategies, and risk factors.

Techniques for data extraction from financial reports 

OCR: The Brittle incumbent

Optical Character Recognition (OCR), or its slightly improved version, Intelligent Character Recognition (ICR) are techniques to digitalize text from documents by identifying characters in scanned images. It was invented in 1914 by the physicist Emanuel Goldberg.

This technology is effective for structured data like financial tables but has three significant limitations:

  1. Accuracy: OCR can struggle with text clarity, especially from low-quality scans or unusual fonts. This leads to errors in the digitized output.

  2. Rigidness: Rigidity is a significant limitation of OCR when it comes to handling document layouts. It operates based on predefined rules that assign values to specific fields, which is effective for documents with uniform layouts. However, OCR often struggles when faced with variations in layout. For example, if column headers and their corresponding data are spaced far apart or aligned vertically, OCR might misinterpret or overlook their connections. Such inflexibility can result in data being misclassified or missed altogether, particularly in environments with diverse document formats.

  3. Contextual Understanding: OCR cannot comprehend the context or meaning behind the text, making it inadequate for analyzing narrative sections or extracting nuanced information.

AI and Large Language Models

Large Language Models (LLMs) represent a significant leap over traditional OCR. These models, which include applications like Trellis for data extraction and ChatGPT for chatbots, are powered by Deep Learning. This branch of artificial intelligence (AI) mimics the human brain in processing data. As a result, LLMs can interpret and generate text with human-like accuracy, greatly improving data extraction from unstructured documents.

Flexibility: Unlike OCR, which struggles with slight variations in document layouts, LLMs excel in handling diversity. They can accurately extract information from documents with varied formats, reporting standards, and presentation styles. This ability extends to complex layouts without the need for predefined rules. 

Contextual Understanding: LLMs can grasp the text, its surrounding context, and how they relate to each other. This helps them understand complex narratives within documents. For example, when analyzing financial reports, you can use Trellis to run queries like:

  1. Numeric classification: “Quantify geopolitical supply chain risks related to China's position on Taiwan, expressed as a risk level between 1 and 10.”

  2. Text Description: “Describe cybersecurity threats and attacks in the last quarter.”

  3. Text Classification: “Classify the categories of the company (Technology, media, oil and gas, financial services, etc.)” 

  4. Numeric Extraction: “Extract basic shares outstanding from the Consolidated Statements of Income from the latest fiscal year-end date.”

  5. Numeric Extraction: “Extract diluted earnings per share from the Consolidated Statements of Income for the latest fiscal year-end.”

These advancements make LLMs like Trellis invaluable for companies seeking deep, actionable insights from complex financial documents. Additionally, the ability to process large volumes of diverse, unstructured documents simultaneously turns previously unusable data into insights.

Extracting Complex Data from Thousands of Reports

An Overwhelming Amount of Data

There are approximately 55,214 public companies worldwide. Of these, 5,704 are listed on the New York Stock Exchange (NYSE) and Nasdaq[1].

With traditional OCR and AI data extraction techniques, we typically extract all information from all documents. Most documents differ slightly, so extracting a specific data point often means writing a custom extractor, which requires expensive engineering work.

Even if we only want to extract and analyze the data from a few financial reports, this leaves us with the engineering work of making sense of this data and organizing it in a database.

Solution: Making Unstructured Data SQL-Queryable with Trellis

We commonly don’t want to extract all data. Instead, we only want to answer a few questions about a set of documents. Secondly, we want to avoid using our precious engineering resources to build custom extractors. These take a lot of time to maintain, and they don’t generalize to new extractions we might need in the future.

Instead of extracting all raw data from documents, Trellis allows users to answer a set of particular questions they have about them. These data extractions are called Transformations. You can create and run transformations via our dashboard or our API. Here’s how it works:

  1. Upload Your Documents: Upload the documents to be analyzed to the Trellis platform.

  2. Define Transformations: Next, you define transformations. Each transformation consists of:

    1. Column Type: Defines the SQL data type of the column in which Trellis will store the extracted data.

    2. Transformation Type: Specifies whether the operation is an extraction, classification, or generation. This setting guides our engine in selecting the appropriate model for data processing.

    3. Description: An English description instructs the Trellis AI models on what data to extract from the document.

  3. Run Queries: Once you run a Transformation, Trellis applies them across all your documents, simultaneously extracting the answers into a SQL-queryable format.

Example Use Case: Imagine you want to analyze operational costs from various financial reports. You could configure a transformation that extracts specific operational cost data from these reports. Each column in the transformation could be set to extract different aspects of operational costs—like total cost, types of costs, and cost changes over time. A few months later, you realize your model becomes more precise when you extract an additional data point. With Trellis, you quickly add this new extraction, enhancing your model’s accuracy in just a few minutes.

This method streamlines the data extraction process and reduces the need for extensive manual review, making obtaining actionable insights from large volumes of documents easier.

Conclusion

Traditional OCR and ICR methods are inadequate for extracting complex information from financial reports. Trellis overcomes these limitations using Large Language Models (LLMs) to transform unstructured data into SQL-queryable columns. This developer-friendly technology enables companies to provide their clients with deeper, more holistic, actionable financial analysis.

[1]: https://focus.world-exchanges.org/issue/february-2024/market-statistics

Looking for a solution to extract complex data from financial reports at scale?

Trellis’ AI-powered transformations make your unstructured data SQL-queryable in seconds.

Book demo today

© 2024 Trellis. All rights reserved.