Retrieval-Augmented Generation (RAG): Cost Calculator for OpenAI

Dan Carotenuto, Jason Harvey
Apr 1
16 min read

Updated: Apr 13

Code Walkthrough

Calculate proprietary LLM cost for your generative AI app with and without Retrieval-Augmented Generation (RAG)

Level: Beginner - Intermediate

Languages: Python, Google Colab Notebook

Download from Github: Retrieval-Augmented Generation (RAG) Cost Analysis with OpenAI

Retrieval-Augmented Generation (RAG) provides a smarter way to leverage Large Language Models (LLMs) for use with local knowledge sources while reducing expenses, improving accuracy, enabling scalability and making generative AI apps economically viable.

This notebook shows how RAG can be used to help manage costs of proprietary LLMs, specifically OpenAI. You won't need any licenses or paid subscriptions. See our blog, "AI Done Right: How RAG Saves Money & Delivers Results" for a discussion of the business value of RAG in an LLM app.

The new ChatGPT “Internal Knowledge” feature from OpenAI (beta as of the writing of this blog) is likely using a RAG style framework, although OpenAI has not publicly disclosed details about the feature's underlying technologies.

What is RAG?

RAG is an AI framework that enhances LLMs by retrieving relevant business information before generating responses. Instead of solely relying on a model's pre-trained knowledge, RAG pulls in real-time, business-specific information—typically in text documents—to provide accurate, contextual, and cost-effective answers.

Unlike traditional LLMs that require frequent fine-tuning, RAG dynamically retrieves information from local knowledge stores, allowing businesses to:

Reduce AI token usage (lower API costs for proprietary LLMs)
Improve accuracy (minimize AI hallucinations)
Integrate real-time company data (no need for constant re-training)
Ensure compliance & data security (keep proprietary data internal)

Cost without RAG for OpenAI

Figure 1, "Yearly Cost Forecast: OpenAI LLM API - RAG vs Without RAG," was created using the notebook in this blog and shows the impact RAG can have on proprietary LLMs.

Figure 1, "Yearly Cost Forecast: OpenAI LLM API - RAG vs Without RAG" shows the impact RAG can have on proprietary LLMs. — ***Figure 1***, "Yearly Cost Forecast: OpenAI LLM API - RAG vs Without RAG" shows the impact RAG can have on proprietary LLMs.

RAG is a Two-step Process

RAG is fundamentally a two-step process: retrieval and generation.

Retrieval:

Business knowledge is stored in a vector database (Pinecone, Weaviate, ChromaDB) that indexes and organizes documents as numerical embeddings.
When a user submits a query, RAG retrieves only the most relevant information, reducing AI token consumption.

Generation:

The retrieved knowledge is appended to the AI prompt before generating a response.
This ensures the LLM is context-aware, improving accuracy and minimizing hallucinations.

As shown in Figure 2, "Retrieval-Augmented Generation (RAG) Costs Analysis," using a RAG framework can result in significant cost savings for generative AI apps that use proprietary LLMs.

Figure 1: Retrieval-Augmented Generation (RAG) Cost Analysis - A RAG framework can result in significant cost savings for generative AI that use a proprietary LLM. — ***Figure 2***: **Retrieval-Augmented Generation (RAG) Cost Analysis** - A RAG framework can result in significant cost savings for generative AI that use a proprietary LLM.

Business Case / Problem Summary

Market researchers at an organization need to analyze financial reports from competitors the day they are published. They want to ask questions about the reports instead of having to read through them. Sample queries are as follows:

What was the revenue for the first quarter of fiscal 2025?
Can you explain the Spacejet platform mentioned in the report?
What are the expected revenue projections for the second quarter of fiscal 2025?

The organization wants to make these reports available through a hosted generative AI application based on OpenAI, but also needs to understand corresponding usage costs to help manage them effectively.

Objective

The objective of this notebook is to build an analysis tool for OpenAI's usage costs for input tokens with RAG and without RAG. The tool would be leveraged to create LLM cost management strategies including:

Create LLM API Cost Forecasts and Budgets
Pre-emptive LLM API cost alerts
Optimize LLM user queries with caching and batching techniques

NOTE: This example will focus on OpenAI GPT models and can be extended to evaluate other proprietary LLMs like Anthropic Claude models. A separate notebook will focus on reasoning models like OpenAI's o1 and o3-mini, which behave differently.

Key Questions

What is the cost difference when using RAG?
What is the magnitude of the cost difference when using RAG?
What would the cost difference look like over a 1 year period?
How much would be saved?

Solution

Identifying usage costs of proprietary LLM APIs requires accurately counting input tokens—tokens sent as input to the hosted LLM API—and output tokens—tokens sent as output from the hosted LLM API as output. Output tokens are typically more expensive than input tokens. Different LLMs do not count tokens the same way and, as a result, token counts for the same text may differ between LLMs. OpenAI defines tokens as follows:

Tokens can be thought of as pieces of words. Before the API processes the request, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end - tokens can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:

1 token ~= 4 chars in English
1 token ~= ¾ words
100 tokens ~= 75 words

Source: What are Tokens and How to Count Them

Notebook Walkthrough

The notebook used the following steps to ensure it is self-contained and can be run without any required configuration:

Set Up Required Libraries, Imports and Defaults
Set Up a Token Calculator
Create a Sample Local Knowledge Base
Convert the Knowledge Base to Embeddings
Convert User Query to OpenAI RAG Message
Create Synthetic Usage Data and Sample Queries
Generate a RAG Message from a User Query
Forecast 1 Year of OpenAI API Cost for a Query
Observations and Insights
Conclusions

1. Set up Required Libraries, Imports and Defaults

BACK TO LIST

# Necessary Libraries
import warnings
warnings.filterwarnings('ignore')
import importlib.util
import subprocess
import sys

import pandas as pd
import numpy as np
from datetime import datetime
from pandas.tseries.offsets import BDay
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

Embeddings Related Imports

# Installs embeddings related libraries if not already installed
def is_library_installed(library_name):
    return importlib.util.find_spec(library_name) is not None

def ensure_library_installed(library_name):
    if not is_library_installed(library_name):
        print(f"{library_name} not found. Installing...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", library_name])
        print(f"{library_name} installed successfully.")
    else:
        print(f"{library_name} is already installed.")

# Confirm tiktoken library installed
ensure_library_installed("tiktoken")

# Confirm FAISS library installed
ensure_library_installed("faiss-cpu")

Import tiktoken to count OpenAI tokens

# Setting up OpenAI tiktoken library to count tokens
import tiktoken

Import FAISS

We will use Facebook AI Similarity Search (FAISS) to search through the embeddings version of our internal knowledge document. In a production implementation you would have your knowledge documents stored in a vector database and you would use the respective vendor's API to search and retrieve relevant parts of the document. FAISS is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions.

import faiss
# Test FAISS import by printing the version of FAISS
# print(faiss.__version__)

Import Embedding libraries

This app will use the "all-MiniLM-L6-v2" model from Hugging Face to convert the document into embeddings. It is a sentence-transformers model that maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

from transformers import AutoTokenizer, AutoModel
import torch

Defaults and Constants

# Notebook Constants
llm_role                   = "user"

# --- Default Cost Rates (USD per token) ---
DEFAULT_COST_RATES = {
    "openai": {
        "gpt-4o-2024-08-06"      : 0.0000025,    # $2.50 per 1M tokens
        "gpt-4o-mini-2024-07-18" : 0.00000015    # $0.15 per 1M tokens
    },
}
gpt_4o = list(DEFAULT_COST_RATES["openai"].keys())[0]
gpt_4o_mini = list(DEFAULT_COST_RATES["openai"].keys())[1]
llm_word_output_limit = 100     # Used in the prompt to limit output words

2. Set up a Token Calculator

BACK TO LIST

Define a function to count tokens for OpenAI LLM API GPT Models (not reasoning models)

# Function to count tokens for OpenAI GPT models API
#   input: GPT models formatted message (role, content)
#   output: number of tokens
#   Supported models for this notebook: 
#      gpt-4o-mini (gpt-4o-mini-2024-07-18), gpt-4o (gpt-4o-2024-08-06)
def get_openai_messages_token_count(messages, model="gpt-4o-mini-2024-07-18"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using o200k_base encoding.")
        encoding = tiktoken.get_encoding("o200k_base")
    if model in {
        "gpt-4o-mini-2024-07-18",     # gpt-4o-mini
        "gpt-4o-2024-08-06"           # gpt-4o
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. "
              "Returning num tokens assuming gpt-3.5-turbo-0125.")
        return get_openai_messages_token_count(messages, model="gpt-3.5-turbo-0125")
    elif "gpt-4o-mini" in model:
        print("Warning: gpt-4o-mini may update over time. " 
              "Returning num tokens assuming gpt-4o-mini-2024-07-18.")
        return get_openai_messages_token_count(messages, model="gpt-4o-mini-2024-07-18")
    elif "gpt-4o" in model:
        print("Warning: gpt-4o and gpt-4o-mini may update over time. "
              "Returning num tokens assuming gpt-4o-2024-08-06.")
        return get_openai_messages_token_count(messages, model="gpt-4o-2024-08-06")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. "
              "Returning num tokens assuming gpt-4-0613.")
        return get_openai_messages_token_count(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"""get_openai_messages_token_count() is not implemented for model {model}."""
        )

    # If a single message is passed (dict), wrap it in a list
    if isinstance(messages, dict):
        messages = [messages]

    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens

Define a Function to get number of tokens in a GPT Model message for all supported models

# --- Main Function: Returns a Pandas DataFrame ---
def get_token_costs_for_all_models(text: dict, cost_rates: dict = DEFAULT_COST_RATES):
    rows = []
    for provider, models in cost_rates.items():
        for model_prefix, rate in models.items():
            try:
                if provider == "openai":
                    token_count = get_openai_messages_token_count(text, model_prefix)
                else:
                    continue  # Skip unknown provider
                token_cost = token_count * rate
                rows.append({
                    "provider": provider,
                    "model": model_prefix,
                    "token_count": token_count,
                    "token_cost": round(token_cost, 6)
                })
            except Exception as e:
                print(f"Error processing {provider}/{model_prefix}: {e}")
    return pd.DataFrame(rows)

Set up an empty Panda dataframe to load the number of tokens for each provider/model

# --- Setup: Initialize an empty DataFrame ---
token_costs_df = pd.DataFrame(columns=["provider", "model", "token_count", "token_cost"])

Get the number of tokens for a message and load them into the dataframe

# --- Function to load new token cost data into the DataFrame ---
def load_token_costs(text, df):
    new_data = get_token_costs_for_all_models(text)

    # Optional: Filter duplicates if needed
    df = df[~df["model"].isin(new_data["model"])]  # Remove existing rows for same models

    # Append new results
    return pd.concat([df, new_data], ignore_index=True)

3. Create a Sample Local Knowledge Base

BACK TO LIST

We will now set up a local document to search. For this example, we’ll use a sample financial report based on a fictitious company called MyFakeCompany. See the downloadable notebook on github, "Retrieval-Augmented Generation (RAG) Cost Analysis with OpenAI,"for the full version of the document.

knowledge_base_text = """
MyFakeCompany Announces Financial Results for First Quarter Fiscal 2025
• Record quarterly revenue of $26.0 billion, up 18% from Q4 and up 262% from a year ago
• Record quarterly Data Center revenue of $22.6 billion, up 23% from Q4 and up 427% from a year ago
• Quarterly cash dividend raised 150% to $0.01 per share on a post-split basis

...

Print out the total number of characters in the document.

print("Total characters in the knowledge base source text = ",len(knowledge_base_text))

Total characters in the knowledge base source text = 16937

4. Convert the Knowledge Base to Embeddings

BACK TO LIST

To retrieve specific segments of documents rather than entire documents, we can implement a process known as chunking. This involves dividing large documents into smaller, semantically coherent segments, allowing the system to retrieve and process only the most relevant portions in response to a query.

# Chunking the document into sections
chunks = knowledge_base_text.split('\n\n')

Print out the total number of embedding chunks.

print("Total embedding chunks created:",len(chunks))

Total embedding chunks created: 18

Configure the Embedding Tokenizer and model using a Hugging Face Model

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

Define a function to convert the text document to an embedding

# Function to convert the text document to an embedding
def get_embedding(text):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    # Get the embeddings from the model
    with torch.no_grad():
        outputs = model(**inputs)
    # Mean pooling to get the sentence embedding
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].numpy()

Generate vector embeddings for each chunk using a suitable embedding model. This allows the system to represent each chunk in a high-dimensional space, facilitating efficient similarity searches.

# Generate embeddings for each chunk
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]

Store the embeddings in a FAISS index to enable rapid similarity searches. This setup allows the system to quickly identify and retrieve the most relevant chunks in response to a query.

# Index the embeddings with FAISS
embedding_dim = chunk_embeddings[0].shape[0]
index = faiss.IndexFlatL2(embedding_dim)
index.add(np.array(chunk_embeddings))

Define a function to retrieve relevant chunks: (Retrieval part of RAG). This function retrieves the most relevant document chunk based on the user’s query.

For a given user query, convert the query into an embedding and search the FAISS index to find the most similar chunks.
Retrieve these chunks to provide context for generating accurate responses.

def retrieve_chunks(query, top_k=1):
    # Generate embedding for the query
    query_embedding = get_embedding(query)
    # Search for the nearest chunks in the index
    distances, indices = index.search(np.array([query_embedding]), k=top_k)
    # Retrieve the corresponding chunks
    retrieved_chunks = [chunks[idx] for idx in indices[0]]
    return retrieved_chunks

5. Convert User Query to OpenAI RAG Message

BACK TO LIST

Define a function that creates the RAG-enabled content part of an OpenAI GPT model message

Key Points

Search and extract only the relevant parts of the knowledge base based on a user query. This is the Retrieval part of RAG
Create an input prompt as text based on the search results
Create an input message (role + content) based on the input prompt
The resulting message will be sent as input text to OpenAI and should be significanty smaller than the entire knowledge base

# Function to generate an LLM prompt based on the embedding chunks
def generate_llm_prompt(query):
    # Retrieve the most relevant chunks
    retrieved_chunks = retrieve_chunks(query, top_k=2)
    # Combine retrieved chunks to form the context
    context = "\n\n".join(retrieved_chunks)
    instructions = (f"Keep the answer succinct, limit to 1 sentence, "
                   f"total words to {llm_word_output_limit}")
    # Create a prompt combining the query and the retrieved context
    prompt = (f"Context: {context}\n\nQuery: {query}"
             f"\nInstructions: {instructions}.\nAnswer: ")
    return prompt

6. Create Synthetic Usage Data and Sample Queries

BACK TO LIST

Synthetic Data Assumptions

Number of daily users that run LLM requests randomly range from 50 to 200
Number of daily LLM run requests for each user randomly ranges from 15 to 25
Data created for each business day for 1 year from January 1, 2025 to December 31, 2025

# Function to create synthetic usage data
# - Number of daily users that run LLM requests randomly range from 50 to 200
# - Number of daily LLM run requests for each user randmonly range from 15 to 25
# - Data is for business days, excludes US federal holidays, Jan 1 2025 to Dec 31 2025
def generate_llm_app_usage(start_date="2025-01-01", end_date="2025-12-31", seed=42):
    np.random.seed(seed)

    # Define the US Federal Holiday Calendar
    us_holidays = USFederalHolidayCalendar().holidays(start=start_date, end=end_date)

    # Create a Custom Business Day offset excluding US federal holidays
    custom_bday = CustomBusinessDay(holidays=us_holidays)

    # Generate date range for business days excluding US federal holidays
    dates = pd.date_range(start=start_date, end=end_date, freq=custom_bday)

    # Generate random average number of runs per user between 7 and 13 (inclusive)
    avg_runs = np.random.randint(15, 26, size=len(dates))

    # Simulate number of users per day (between 50 and 200)
    num_users = np.random.randint(50, 201, size=len(dates))

    # Create DataFrame
    llm_app_usage = pd.DataFrame({
        "date": dates,
        "average_number_runs_per_user": avg_runs,
        "number_of_users": num_users
    })

    return llm_app_usage

# Generate synthetic usage data
llm_app_usage = generate_llm_app_usage()

Show synthetic usage data information

llm_app_usage.describe()

This should show the following:

The synthetic usage data has data for 250 days
The "Average Number of LLM Queries Per User" is randomized with an average of 20 run per day, a minimum of 15 and a maximum of 25 runs per day.
The "Number of Users" running LLM queries each day is radomized with an average of 127 users running LLM queries each day, a minimum of 50 and a maximum of 200.

Define a function to forecast usage costs of an OpenAI query against a local knowledge base for 1 year with and without RAG

Forecast Usage Costs Assumptions

Use the same RAG-based prompt for all LLM run requests to maintain cost uniformity
Leverage synthetic usage data

# Function to calculate LLM API input token usage costs based on synthetic usage data
def calculate_total_token_usage_and_costs(message_text,
                        knowledge_base_text,
                        llm_app_usage,
                        token_costs_func=get_token_costs_for_all_models):

    # Get per-run token cost data for the Message text
    text = message_text
    message_token_costs_df = token_costs_func(text)

    # Update the DataFrame in place and set token columns to "message" token columns
    message_token_costs_df.rename(columns={'token_count': 'message_token_count'},
                        inplace=True)
    message_token_costs_df.rename(columns={'token_cost': 'message_token_cost'},
                        inplace=True)

    # Get per-run token cost data for the Knowledge Base text
    text = knowledge_base_text
    kb_token_costs_df = token_costs_func(text)

    # Update DataFrame in place, set token columns to "knowledge_base" token columns
    kb_token_costs_df.rename(columns={'token_count': 'kb_token_count'}, inplace=True)
    kb_token_costs_df.rename(columns={'token_cost': 'kb_token_cost'}, inplace=True)

    # Columns to append 
    cols_to_append = ['kb_token_count', 'kb_token_cost']

    # Perform the merge
    token_costs_df = pd.merge(message_token_costs_df, 
					kb_token_costs_df[['provider','model'] + cols_to_append],
					on=['provider', 'model'], 
					how='left')

    # Add a dummy key for cross join
    llm_app_usage['key'] = 1
    token_costs_df['key'] = 1

    # Cross join usage with token costs per model
    usage_expanded = pd.merge(llm_app_usage, 
					   token_costs_df, on='key').drop(columns=['key'])

    # Calculate total runs per day
    usage_expanded['total_runs'] = ( 
		usage_expanded['average_number_runs_per_user'] 
		* usage_expanded['number_of_users'])

    # Compute total Message token usage and cost per model per day
    usage_expanded['message_total_tokens'] = (
		usage_expanded['total_runs'] 
		* usage_expanded['message_token_count'])

    usage_expanded['message_total_cost'] = (
		usage_expanded['total_runs'] 
		* usage_expanded['message_token_cost'])

    # Compute total Knowledge Base token usage and cost per model per day
    usage_expanded['kb_total_tokens'] = ( 
		usage_expanded['total_runs'] 
		* usage_expanded['kb_token_count'])

    usage_expanded['kb_total_cost'] = (
		usage_expanded['total_runs'] 
		* usage_expanded['kb_token_cost'])

    # Select and order columns for output
    usage_summary = usage_expanded[[
        'date', 'provider', 'model',
        'average_number_runs_per_user', 'number_of_users',
        'total_runs', 'message_token_count', 'message_token_cost',
        'kb_token_count', 'kb_token_cost',
        'message_total_tokens', 'message_total_cost',
        'kb_total_tokens', 'kb_total_cost'
    ]].copy()

    usage_summary['Year'] = usage_summary['date'].dt.to_period('Y')
    usage_summary['Quarter'] = usage_summary['date'].dt.to_period('Q')
    usage_summary['Month'] = usage_summary['date'].dt.to_period('M')
    usage_summary['Week'] = usage_summary['date'].dt.to_period('W')

    return usage_summary

Create sample user queries

# Create sample user queries
queries = [
    "What was MyFakeCompany's revenue for the first quarter of fiscal 2025?",
    "Can you explain the Spacejet platform mentioned in the report?",
    "What are the expected revenue projections for the second quarter of fiscal 2025?",
    "Can you explain the Odin platform?"
]

7. Generate a RAG Message from a User Query

BACK TO LIST

# Generate an LLM prompt based on a sample query
query = queries[0]
rag_prompt = generate_llm_prompt(query)

8. Forecast 1 year of OpenAI API Cost for a Query

BACK TO LIST

Define a function to Expand Input Prompt to Input Message **Function to convert RAG-enabled "content" and knowledge base "content" to OpenAI GPT model format (role, content)

Key Points:

OpenAI GPT model API requires input be configured as a "message" with "role=[role]" and "content=[prompt]" key-value pairs.
This will ensure accurate token counts and corresponding token cost.
This resulting token count will be higher than the total number of tokens derived from only the prompt text.

# Get LLM API usage costs for the current prompt and full knowledge base text
#  - Requires adding the roleformat
def get_usage_cost(llm_role, rag_prompt, knowledge_base_prompt):
      rag_message = [
          {
              "role": llm_role,
              "content": rag_prompt,
          },
      ]
      kb_message = [
          {
              "role": llm_role,
              "content": knowledge_base_prompt,
          },
      ]
      usage_costs_df = calculate_total_token_usage_and_costs(rag_message, kb_message, llm_app_usage)
      return usage_costs_df

Start Calculating Tokens and Cost

Calculate the tokens, token costs and usage data for the RAG message and the knowledge base message
Load the results into a dataframe for analysis

usage_costs_df = get_usage_cost(llm_role, rag_prompt, knowledge_base_text)

9. Observations and Insights

BACK TO LIST

Define a function that prepares a dataframe for analysis including aggregating token costs by period

# Define a function that prepares a dataframe for aggregating token costs by period
def cost_summary(period="Month", gpt_model="gpt-4o-2024-08-06"):
    period = period.lower()
    match period:
        case "year":
          period = "Year"
        case "month":
          period = "Month"
        case "week":
          period = "Week"
        case "day":
          period = "Day"
    df = usage_costs_df.copy()

    # Filter on the GPT model
    filtered_df = df[df['model'] == gpt_model]

    # Group by period and sum both cost columns
    cost_summary = (
        filtered_df.groupby(period).agg(
            message_token_count=('message_token_count', 'max'),
            message_token_cost=('message_token_cost', 'max'),
            kb_token_count=('kb_token_count', 'max'),
            kb_token_cost=('kb_token_cost', 'max'),
            message_total_cost=('message_total_cost', 'sum'),
            kb_total_cost=('kb_total_cost', 'sum')
        ).reset_index()    )

    # Round the cost columns to integers
    cost_summary['message_total_cost'] = ( cost_summary['message_total_cost']
                                             .round(0).astype(int) )
    cost_summary['kb_total_cost'] = ( cost_summary['kb_total_cost']
                                        .round(0).astype(int) )

    # Rename columns
    cost_summary.rename(columns={
        'message_total_cost': 'Total Cost with RAG',
        'kb_total_cost': 'Total Cost without RAG',
        'message_token_cost': '1 Query Cost with RAG',
        'kb_token_cost': '1 Query Cost without RAG',
        'message_token_count': '1 Query Tokens with RAG',
        'kb_token_count': '1 Query Tokens without RAG' }, inplace=True)

    # Calculate cost without RAG (knowledge base) relative ratio
    cost_summary['Cost without RAG Relative Ratio'] = (
                      cost_summary['1 Query Cost without RAG'] 
                      / cost_summary['1 Query Cost with RAG'] )

    # Calculate percent change from 'Message Cost' to 'Cost without RAG'
    cost_summary['Cost Percent Change'] = (
                 ((cost_summary['1 Query Cost without RAG'] 
                   - cost_summary['1 Query Cost with RAG']) 
                     / cost_summary['1 Query Cost with RAG'])  * 100 )

    # Optional: Round for readability
    cost_summary['Cost without RAG Relative Ratio'] = (
                    cost_summary['Cost without RAG Relative Ratio'].round(2))
    cost_summary['Cost Percent Change'] = ( 
                  cost_summary['Cost Percent Change'].round(2))
    return cost_summary

Define a function that creates a bar chart of token costs with options for period and sorting by a metric.

# Define a function that creates a bar chart of token costs by period
def plot_costs_comparison( period='Month',sorted_by='none'):
    summary_df = cost_summary(period)
    match sorted_by:
        case 'none':
            df_long = summary_df
        case 'Total Cost without RAG':
            # Sort DataFrame by 'Cost without RAG' descending
            summary_df = summary_df.sort_values(by=sorted_by, ascending=False)

    # Melt for seaborn
    df_long = pd.melt(
        summary_df,
        id_vars=period,
        value_vars=['Total Cost with RAG', 'Total Cost without RAG'],
        var_name='Cost Type',
        value_name='Total Cost'
    )

    plt.figure(figsize=(12, 6))
    ax = sns.barplot(data=df_long, x=period, y='Total Cost', hue='Cost Type')

    # Add the main title
    plt.suptitle(f'Cost Forecast by {period}: OpenAI LLM API - RAG vs Without RAG',
                fontsize=16, fontweight='bold')

    # Add the subtitle
    ax.set_title(f'Cost without RAG: {rel_ratio_cost}x higher, GPT Model : {gpt_4o}', 
                  fontsize=12, style='normal')

    # Add a multi-line footer
    footer_text = (f"Sample Usage Data: daily users=50-200, daily LLM runs=15-25, "
                  f"Business days Jan 1 2025 - Dec 31 2025\n"
                  f"Internal knowledge document tokens={kb_tokens}, "
                  f"RAG-based message tokens={message_tokens}")

    plt.figtext(0.5, -0.05, footer_text, ha='center', 
                fontsize=12, alpha=0.7, linespacing=1.5)

    # Annotate bars
    for container in ax.containers:
        ax.bar_label(container, fmt='${:,.0f}', padding=3)

    plt.xlabel(period)
    plt.ylabel('Total Cost (USD)')
    plt.xticks(rotation=45)
    plt.legend(title='Cost Type')
    plt.tight_layout()
    plt.show()

Show OpenAI API token counts and cost for 1 user query without RAG versus with RAG.

summary_df = cost_summary("Year")
message_tokens  = summary_df.at[0,'1 Query Tokens with RAG']
message_cost    = summary_df.at[0,'1 Query Cost with RAG']
kb_tokens       = summary_df.at[0,'1 Query Tokens without RAG']
kb_cost         = summary_df.at[0,'1 Query Cost without RAG']
rel_ratio_cost  = summary_df.at[0,'Cost without RAG Relative Ratio']
pct_change_cost = summary_df.at[0,'Cost Percent Change']

print("Tokens and cost of 1 user query:")
print(f"With RAG - Tokens.   : {message_tokens:,.0f}")
print(f"With RAG - Cost      : ${message_cost:,.7f}")
print(f"Without RAG - Tokens : {kb_tokens:,.0f}")
print(f"Without RAG - Cost   : ${kb_cost:,.7f}")
print(f"Relative Cost Ratio  : {rel_ratio_cost:,.2f}")
print(f"Percent Change       : {pct_change_cost:,.2f}%")

This should output the following:

Tokens and cost of 1 user query: With RAG - Tokens. : 345 With RAG - Cost : $0.0008630 Without RAG - Tokens : 3,949 Without RAG - Cost : $0.0098730 Relative Cost Ratio : 11.44 Percent Change : 1,044.03%

The following are the OpenAI API token counts and cost for 1 user query without RAG versus with RAG:

The number of tokens in the financial report is 3,949.
The cost of using the financial report as input to OpenAI is $0.0098730
The number of tokens in the RAG message is 345.
The cost of using the RAG message as input to OpenAI is $0.0008630.
The cost without RAG is approximately 11 times higher than the cost with RAG.

Create a bar chart of token costs by year

plot_costs_comparison("Year")

Figure 3, "Yearly Cost Forecast: OpenAI LLM API - RAG vs Without RAG" shows the impact RAG can have on proprietary LLMs. — ***Figure 3***, "Cost Forecast by Year: OpenAI LLM API - RAG vs Without RAG" - A year of running 15 to 25 queries a day by 50 to 200 daily users for a message of 345 tokens can result in over $6,000 of annual costs if you do not use RAG

Create a bar chart of token costs by month

plot_costs_comparison("Month")

Figure 4, "Cost Forecast by Month: OpenAI LLM API - RAG vs Without RAG" — ***Figure 4***, "Cost Forecast by Month: OpenAI LLM API - RAG vs Without RAG" - The monthly costs span from over $400 to over $600

Insights Summary

The following are the OpenAI API token counts and cost for the same user query forecasted over a 1 year period:

Without RAG:

Total 1 year cost : $6,266
The highest total cost for a month : $608
The lowest total cost for a month : $405

With RAG

Total 1 year cost : $548
The highest total cost for a month : $53
- The lowest total cost for month : $35

10. Conclusions

BACK TO LIST

The organization should leverage the RAG AI framework to enable their users to ask questions about recently published financial reports.
Estimates show RAG could be approximately 10 times less costly than not using a RAG approach.
Estimates show total cost savings can be into the thousands per year and likely over $6,000 depending on the number of users and how often they use the generative aI application with a proprietary LLM.
The organization should leverage and customize this cost analysis tool to develop cost optimization strategies for proprietary LLMS in their production environments.

Bibliography

OpenAI

Need Help with Your AI Initiative?

Our AI Strategy Advisory Services can help you leverage the latest AI technologies as a force multiplier across your organization. Contact Us to discuss how RAG can make AI scalable and cost-efficient for your business.