Predli Blog - RAG Series: Making Sense of Internal Data With GraphRAG

Introduction

‍

Retrieval-Augmented Generation (RAG) systems enhance the capabilities of large language models (LLMs) by retrieving relevant information from external sources beyond the models' internal knowledge. This functionality enables users to pose questions whose answers are derived from a defined corpus of documents. In a typical Naïve RAG, relevant text chunks are retrieved from a vector database using semantic similarity search and passed to an LLM as retrieved context to ground its responses. This approach works well when the answer can be found directly in the text, but struggles with questions that require understanding how the information fits into the broader context of the dataset.

‍

‍

GraphRAG, introduced by Microsoft last year, addresses these limitations by constructing a knowledge graph from the corpus. This graph captures entities, relationships, and semantic structures within the data, thereby preserving context and enabling a more organized and interpretable representation of the information. GraphRAG has been shown to outperform traditional Naïve RAG systems in scenarios that demand a broader understanding of the dataset. For an introduction to RAG and GraphRAG, see our previous blog post.

‍

‍

To explore its real-world applicability, Predli commissioned a master’s thesis project aimed at comparing GraphRAG with traditional RAG systems and assessing its potential to make sense of internal data. This blog post shares some of the key learnings that have been uncovered so far.

‍

Constructing the Graph

‍

The knowledge graph serves as a structured representation of the underlying data. Its primary goal is to capture relevant information in a way that reflects both the entities involved and the relationships between them. The graph is organized into Entities, Relationships, and Communities.

1. Entities encompass both tangible objects and abstract concepts, such as individuals, organizations, and locations. Each entity is defined by a title and a detailed description that offers clarity and contextual relevance.

2. Relationships include the various contextual connections between entities. For instance, it may represent social ties among friends or professional associations between colleagues. Each relationship is defined with a description that explains how the two entities interact.

3. Communities are clusters of entities and their relationships organized around a shared theme. Each community is summarized in a thorough community report that outlines its core focus and significance.

GraphRAG utilizes an LLM in an automated pipeline to systematically extract the graph components from the source documents. Once the graph has been constructed, hosting it in a dedicated graph database can make it easier to query, maintain, and integrate into downstream applications. To support this, Predli has partnered with Neo4j, a leading platform purpose-built for managing graph data at scale. Neo4j provides a dedicated Python package for GraphRAG, which streamlines the process.

‍

Querying the Graph - Local and Global Search

‍

Two distinct approaches to querying the graph are the Local method and the Global method.

The Local method builds upon the traditional vector approach while incorporating knowledge from the graph. It begins by performing an initial similarity search across the graph's entities to pinpoint those most relevant to the user's query. These identified entities serve as entry points for a graph traversal that collects adjacent entities, their relations, and associated community reports. This enriched context enables the LLM to generate more accurate and contextually grounded responses.

In contrast, the Global method employs a holistic approach. Rather than relying solely on Local entry points such as individual entities, it systematically filters and processes community summaries to develop a broader contextual comprehension of the query. By extracting key facts from these summaries, the algorithm generates a set of analytical insights, which contributes to a more comprehensive understanding of the complete dataset. The Global method proves advantageous when addressing questions that demand a wide-ranging perspective of the data.

‍

GraphRAG vs Naïve RAG

‍

To compare the two systems, we used a curated dataset of Form 10-Q reports submitted to the Securities and Exchange Commission between 2022 and 2023. This collection comprises reports from the technology companies Nvidia, Apple, Amazon, Microsoft, and Intel. Recognizing that a clear understanding of the underlying data is essential, we assessed each system’s comprehension by posing a question directly related to the dataset’s content.

‍

“What companies exist in the corpus?”

‍

Table 1: Comparison of query responses to the question “*What companies exist in the corpus?”*

‍

GraphRAG successfully answers the question by accurately identifying and listing all five companies present in the text corpus, while the Naïve RAG approach falls short, as these companies are not included in its retrieved context. While it correctly references Intel, it mistakenly identifies it as the primary focus. Additionally, it mentions Brookfield Asset Management, which is only briefly mentioned in a few reports but is not one of the five companies featured in the dataset. GraphRAG, on the other hand, correctly interprets the question as asking for the companies whose reports make up the dataset.

This contrast highlights a fundamental difference in methodology. The Naïve RAG retrieves a few isolated text snippets based on the query “What companies exist in the corpus?”, which leads to an incomplete understanding and limited context. As a result, it misses key information and misjudges relevance. By using the community summaries, GraphRAG gains a holistic view that allows it to identify the most relevant companies with precision. Since the dataset consists of reports from five major technology firms, GraphRAG’s comprehensive approach makes it both effective and reliable at pinpointing them.

‍

‍

The accompanying graph visualization exemplifies this concept. Amazon, which is mentioned frequently in the text corpus, appears as an entity with a robust network of relationships. These connections, which include topics such as business license requirements, data centers, international market dynamics, and retail operations, collectively underscore its prominence. Modeling these associations in a knowledge graph enables GraphRAG to derive an understanding of Amazon's contextual importance, a capability that Naïve RAG systems lack.

‍

Scenario: Relating Financial Performance Trends to Market Events

‍

Imagine you are a financial analyst preparing a briefing to explain the underlying factors influencing quarterly financial results. Beyond the raw numbers, leadership wants to understand how shifts in revenue, margins, and cash flow map to the market‑moving events described in the very same Form 10‑Q filings, such as supply chain constraints, inflationary pressure, or new trade restrictions. To ground that analysis, you ask your RAG:

“How do financial performance trends relate to market events in this dataset?”

‍

Table 2: Comparison of query responses to the question “*How do financial performance trends relate to market events in this dataset?”*

‍

The Naïve RAG retrieves isolated text segments mentioning terms such as net income, market volatility, and fair-value measurements. Because these segments are disconnected from their broader textual context, the Naïve RAG struggles to associate them with a specific company. This issue is evident in one of the highlighted excerpts, where the Naïve RAG states, “For instance, the net income stood at $768 million for a specific quarter…” without identifying the company, the quarter, or providing insight into the net income trend. Consequently, the response does not provide any actionable insights.

In contrast, GraphRAG can connect companies in the corpus with specific events (e.g., supply chain disruptions, regulatory changes, shifts in market share) and relevant metrics (such as revenue and R&D spending). As a result, GraphRAG can generate insights such as:

‍

• Nvidia ⇒ Market share ↑ ⇒ Revenue ↑

• Microsoft ⇒ Inflation ↑ ⇒ Pricing strategy ↻ (revise)

‍

Since the relationships are explicit in the graph, GraphRAG can discover cause-and-effect relationships, which are precisely the kinds of insights a financial analyst or CFO requires when linking performance trends to real-world events.

‍

Final Thoughts

‍

By leveraging a knowledge graph, GraphRAG can support a more structured and coherent interpretation of complex textual data. This approach offers practical advantages over traditional RAG systems, particularly when addressing analytical questions that involve connecting information to a broader textual context. When analysis requires more than surface-level retrieval, GraphRAG offers a scalable solution for extracting actionable intelligence from unstructured text.

‍

Learn more