Retrieval-Augmented Generation (RAG): Cost Calculator for OpenAI
- Dan Carotenuto, Jason Harvey
- Apr 1
- 16 min read
Updated: Apr 13
Code Walkthrough

Level: Beginner - Intermediate
Languages: Python, Google Colab Notebook
Download from Github: Retrieval-Augmented Generation (RAG) Cost Analysis with OpenAI
Retrieval-Augmented Generation (RAG) provides a smarter way to leverage Large Language Models (LLMs) for use with local knowledge sources while reducing expenses, improving accuracy, enabling scalability and making generative AI apps economically viable.
This notebook shows how RAG can be used to help manage costs of proprietary LLMs, specifically OpenAI. You won't need any licenses or paid subscriptions. See our blog, "AI Done Right: How RAG Saves Money & Delivers Results" for a discussion of the business value of RAG in an LLM app.
The new ChatGPT “Internal Knowledge” feature from OpenAI (beta as of the writing of this blog) is likely using a RAG style framework, although OpenAI has not publicly disclosed details about the feature's underlying technologies.
What is RAG?
RAG is an AI framework that enhances LLMs by retrieving relevant business information before generating responses. Instead of solely relying on a model's pre-trained knowledge, RAG pulls in real-time, business-specific information—typically in text documents—to provide accurate, contextual, and cost-effective answers.
Unlike traditional LLMs that require frequent fine-tuning, RAG dynamically retrieves information from local knowledge stores, allowing businesses to:
Reduce AI token usage (lower API costs for proprietary LLMs)
Improve accuracy (minimize AI hallucinations)
Integrate real-time company data (no need for constant re-training)
Ensure compliance & data security (keep proprietary data internal)
Cost without RAG for OpenAI
Figure 1, "Yearly Cost Forecast: OpenAI LLM API - RAG vs Without RAG," was created using the notebook in this blog and shows the impact RAG can have on proprietary LLMs.

RAG is a Two-step Process
RAG is fundamentally a two-step process: retrieval and generation.
Retrieval:
Business knowledge is stored in a vector database (Pinecone, Weaviate, ChromaDB) that indexes and organizes documents as numerical embeddings.
When a user submits a query, RAG retrieves only the most relevant information, reducing AI token consumption.
Generation:
The retrieved knowledge is appended to the AI prompt before generating a response.
This ensures the LLM is context-aware, improving accuracy and minimizing hallucinations.
As shown in Figure 2, "Retrieval-Augmented Generation (RAG) Costs Analysis," using a RAG framework can result in significant cost savings for generative AI apps that use proprietary LLMs.

Business Case / Problem Summary
Market researchers at an organization need to analyze financial reports from competitors the day they are published. They want to ask questions about the reports instead of having to read through them. Sample queries are as follows:
What was the revenue for the first quarter of fiscal 2025?
Can you explain the Spacejet platform mentioned in the report?
What are the expected revenue projections for the second quarter of fiscal 2025?
The organization wants to make these reports available through a hosted generative AI application based on OpenAI, but also needs to understand corresponding usage costs to help manage them effectively.
Objective
The objective of this notebook is to build an analysis tool for OpenAI's usage costs for input tokens with RAG and without RAG. The tool would be leveraged to create LLM cost management strategies including:
Create LLM API Cost Forecasts and Budgets
Pre-emptive LLM API cost alerts
Optimize LLM user queries with caching and batching techniques
NOTE: This example will focus on OpenAI GPT models and can be extended to evaluate other proprietary LLMs like Anthropic Claude models. A separate notebook will focus on reasoning models like OpenAI's o1 and o3-mini, which behave differently.
Key Questions
What is the cost difference when using RAG?
What is the magnitude of the cost difference when using RAG?
What would the cost difference look like over a 1 year period?
How much would be saved?
Solution
Identifying usage costs of proprietary LLM APIs requires accurately counting input tokens—tokens sent as input to the hosted LLM API—and output tokens—tokens sent as output from the hosted LLM API as output. Output tokens are typically more expensive than input tokens. Different LLMs do not count tokens the same way and, as a result, token counts for the same text may differ between LLMs. OpenAI defines tokens as follows:
Tokens can be thought of as pieces of words. Before the API processes the request, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end - tokens can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:
1 token ~= 4 chars in English
1 token ~= ¾ words
100 tokens ~= 75 words
Notebook Walkthrough
The notebook used the following steps to ensure it is self-contained and can be run without any required configuration:
1. Set up Required Libraries, Imports and Defaults
# Necessary Libraries
import warnings
warnings.filterwarnings('ignore')
import importlib.util
import subprocess
import sys
import pandas as pd
import numpy as np
from datetime import datetime
from pandas.tseries.offsets import BDay
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
Embeddings Related Imports
# Installs embeddings related libraries if not already installed
def is_library_installed(library_name):
return importlib.util.find_spec(library_name) is not None
def ensure_library_installed(library_name):
if not is_library_installed(library_name):
print(f"{library_name} not found. Installing...")
subprocess.check_call([sys.executable, "-m", "pip", "install", library_name])
print(f"{library_name} installed successfully.")
else:
print(f"{library_name} is already installed.")
# Confirm tiktoken library installed
ensure_library_installed("tiktoken")
# Confirm FAISS library installed
ensure_library_installed("faiss-cpu")Import tiktoken to count OpenAI tokens
# Setting up OpenAI tiktoken library to count tokens
import tiktokenImport FAISS
We will use Facebook AI Similarity Search (FAISS) to search through the embeddings version of our internal knowledge document. In a production implementation you would have your knowledge documents stored in a vector database and you would use the respective vendor's API to search and retrieve relevant parts of the document. FAISS is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions.
import faiss
# Test FAISS import by printing the version of FAISS
# print(faiss.__version__)Import Embedding libraries
This app will use the "all-MiniLM-L6-v2" model from Hugging Face to convert the document into embeddings. It is a sentence-transformers model that maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.
from transformers import AutoTokenizer, AutoModel
import torchDefaults and Constants
# Notebook Constants
llm_role = "user"
# --- Default Cost Rates (USD per token) ---
DEFAULT_COST_RATES = {
"openai": {
"gpt-4o-2024-08-06" : 0.0000025, # $2.50 per 1M tokens
"gpt-4o-mini-2024-07-18" : 0.00000015 # $0.15 per 1M tokens
},
}
gpt_4o = list(DEFAULT_COST_RATES["openai"].keys())[0]
gpt_4o_mini = list(DEFAULT_COST_RATES["openai"].keys())[1]
llm_word_output_limit = 100 # Used in the prompt to limit output words 2. Set up a Token Calculator
Define a function to count tokens for OpenAI LLM API GPT Models (not reasoning models)
# Function to count tokens for OpenAI GPT models API
# input: GPT models formatted message (role, content)
# output: number of tokens
# Supported models for this notebook:
# gpt-4o-mini (gpt-4o-mini-2024-07-18), gpt-4o (gpt-4o-2024-08-06)
def get_openai_messages_token_count(messages, model="gpt-4o-mini-2024-07-18"):
"""Return the number of tokens used by a list of messages."""
try:
encoding = tiktoken.encoding_for_model(model)
except KeyError:
print("Warning: model not found. Using o200k_base encoding.")
encoding = tiktoken.get_encoding("o200k_base")
if model in {
"gpt-4o-mini-2024-07-18", # gpt-4o-mini
"gpt-4o-2024-08-06" # gpt-4o
}:
tokens_per_message = 3
tokens_per_name = 1
elif "gpt-3.5-turbo" in model:
print("Warning: gpt-3.5-turbo may update over time. "
"Returning num tokens assuming gpt-3.5-turbo-0125.")
return get_openai_messages_token_count(messages, model="gpt-3.5-turbo-0125")
elif "gpt-4o-mini" in model:
print("Warning: gpt-4o-mini may update over time. "
"Returning num tokens assuming gpt-4o-mini-2024-07-18.")
return get_openai_messages_token_count(messages, model="gpt-4o-mini-2024-07-18")
elif "gpt-4o" in model:
print("Warning: gpt-4o and gpt-4o-mini may update over time. "
"Returning num tokens assuming gpt-4o-2024-08-06.")
return get_openai_messages_token_count(messages, model="gpt-4o-2024-08-06")
elif "gpt-4" in model:
print("Warning: gpt-4 may update over time. "
"Returning num tokens assuming gpt-4-0613.")
return get_openai_messages_token_count(messages, model="gpt-4-0613")
else:
raise NotImplementedError(
f"""get_openai_messages_token_count() is not implemented for model {model}."""
)
# If a single message is passed (dict), wrap it in a list
if isinstance(messages, dict):
messages = [messages]
num_tokens = 0
for message in messages:
num_tokens += tokens_per_message
for key, value in message.items():
num_tokens += len(encoding.encode(value))
if key == "name":
num_tokens += tokens_per_name
num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
return num_tokensDefine a Function to get number of tokens in a GPT Model message for all supported models
# --- Main Function: Returns a Pandas DataFrame ---
def get_token_costs_for_all_models(text: dict, cost_rates: dict = DEFAULT_COST_RATES):
rows = []
for provider, models in cost_rates.items():
for model_prefix, rate in models.items():
try:
if provider == "openai":
token_count = get_openai_messages_token_count(text, model_prefix)
else:
continue # Skip unknown provider
token_cost = token_count * rate
rows.append({
"provider": provider,
"model": model_prefix,
"token_count": token_count,
"token_cost": round(token_cost, 6)
})
except Exception as e:
print(f"Error processing {provider}/{model_prefix}: {e}")
return pd.DataFrame(rows)Set up an empty Panda dataframe to load the number of tokens for each provider/model
# --- Setup: Initialize an empty DataFrame ---
token_costs_df = pd.DataFrame(columns=["provider", "model", "token_count", "token_cost"])Get the number of tokens for a message and load them into the dataframe
# --- Function to load new token cost data into the DataFrame ---
def load_token_costs(text, df):
new_data = get_token_costs_for_all_models(text)
# Optional: Filter duplicates if needed
df = df[~df["model"].isin(new_data["model"])] # Remove existing rows for same models
# Append new results
return pd.concat([df, new_data], ignore_index=True)3. Create a Sample Local Knowledge Base
We will now set up a local document to search. For this example, we’ll use a sample financial report based on a fictitious company called MyFakeCompany. See the downloadable notebook on github, "Retrieval-Augmented Generation (RAG) Cost Analysis with OpenAI,"for the full version of the document.
knowledge_base_text = """
MyFakeCompany Announces Financial Results for First Quarter Fiscal 2025
• Record quarterly revenue of $26.0 billion, up 18% from Q4 and up 262% from a year ago
• Record quarterly Data Center revenue of $22.6 billion, up 23% from Q4 and up 427% from a year ago
• Quarterly cash dividend raised 150% to $0.01 per share on a post-split basis
...Print out the total number of characters in the document.
print("Total characters in the knowledge base source text = ",len(knowledge_base_text))Total characters in the knowledge base source text = 16937
4. Convert the Knowledge Base to Embeddings
To retrieve specific segments of documents rather than entire documents, we can implement a process known as chunking. This involves dividing large documents into smaller, semantically coherent segments, allowing the system to retrieve and process only the most relevant portions in response to a query.
# Chunking the document into sections
chunks = knowledge_base_text.split('\n\n')Print out the total number of embedding chunks.
print("Total embedding chunks created:",len(chunks))Total embedding chunks created: 18
Configure the Embedding Tokenizer and model using a Hugging Face Model
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')Define a function to convert the text document to an embedding
# Function to convert the text document to an embedding
def get_embedding(text):
# Tokenize the input text
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
# Get the embeddings from the model
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling to get the sentence embedding
embeddings = outputs.last_hidden_state.mean(dim=1)
return embeddings[0].numpy()Generate vector embeddings for each chunk using a suitable embedding model. This allows the system to represent each chunk in a high-dimensional space, facilitating efficient similarity searches.
# Generate embeddings for each chunk
chunk_embeddings = [get_embedding(chunk) for chunk in chunks]Store the embeddings in a FAISS index to enable rapid similarity searches. This setup allows the system to quickly identify and retrieve the most relevant chunks in response to a query.
# Index the embeddings with FAISS
embedding_dim = chunk_embeddings[0].shape[0]
index = faiss.IndexFlatL2(embedding_dim)
index.add(np.array(chunk_embeddings))Define a function to retrieve relevant chunks: (Retrieval part of RAG). This function retrieves the most relevant document chunk based on the user’s query.
For a given user query, convert the query into an embedding and search the FAISS index to find the most similar chunks.
Retrieve these chunks to provide context for generating accurate responses.
def retrieve_chunks(query, top_k=1):
# Generate embedding for the query
query_embedding = get_embedding(query)
# Search for the nearest chunks in the index
distances, indices = index.search(np.array([query_embedding]), k=top_k)
# Retrieve the corresponding chunks
retrieved_chunks = [chunks[idx] for idx in indices[0]]
return retrieved_chunks5. Convert User Query to OpenAI RAG Message
Define a function that creates the RAG-enabled content part of an OpenAI GPT model message
Key Points
Search and extract only the relevant parts of the knowledge base based on a user query. This is the Retrieval part of RAG
Create an input prompt as text based on the search results
Create an input message (role + content) based on the input prompt
The resulting message will be sent as input text to OpenAI and should be significanty smaller than the entire knowledge base
# Function to generate an LLM prompt based on the embedding chunks
def generate_llm_prompt(query):
# Retrieve the most relevant chunks
retrieved_chunks = retrieve_chunks(query, top_k=2)
# Combine retrieved chunks to form the context
context = "\n\n".join(retrieved_chunks)
instructions = (f"Keep the answer succinct, limit to 1 sentence, "
f"total words to {llm_word_output_limit}")
# Create a prompt combining the query and the retrieved context
prompt = (f"Context: {context}\n\nQuery: {query}"
f"\nInstructions: {instructions}.\nAnswer: ")
return prompt6. Create Synthetic Usage Data and Sample Queries
Synthetic Data Assumptions
Number of daily users that run LLM requests randomly range from 50 to 200
Number of daily LLM run requests for each user randomly ranges from 15 to 25
Data created for each business day for 1 year from January 1, 2025 to December 31, 2025
# Function to create synthetic usage data
# - Number of daily users that run LLM requests randomly range from 50 to 200
# - Number of daily LLM run requests for each user randmonly range from 15 to 25
# - Data is for business days, excludes US federal holidays, Jan 1 2025 to Dec 31 2025
def generate_llm_app_usage(start_date="2025-01-01", end_date="2025-12-31", seed=42):
np.random.seed(seed)
# Define the US Federal Holiday Calendar
us_holidays = USFederalHolidayCalendar().holidays(start=start_date, end=end_date)
# Create a Custom Business Day offset excluding US federal holidays
custom_bday = CustomBusinessDay(holidays=us_holidays)
# Generate date range for business days excluding US federal holidays
dates = pd.date_range(start=start_date, end=end_date, freq=custom_bday)
# Generate random average number of runs per user between 7 and 13 (inclusive)
avg_runs = np.random.randint(15, 26, size=len(dates))
# Simulate number of users per day (between 50 and 200)
num_users = np.random.randint(50, 201, size=len(dates))
# Create DataFrame
llm_app_usage = pd.DataFrame({
"date": dates,
"average_number_runs_per_user": avg_runs,
"number_of_users": num_users
})
return llm_app_usage
# Generate synthetic usage data
llm_app_usage = generate_llm_app_usage()Show synthetic usage data information
llm_app_usage.describe()This should show the following:
The synthetic usage data has data for 250 days
The "Average Number of LLM Queries Per User" is randomized with an average of 20 run per day, a minimum of 15 and a maximum of 25 runs per day.
The "Number of Users" running LLM queries each day is radomized with an average of 127 users running LLM queries each day, a minimum of 50 and a maximum of 200.
Define a function to forecast usage costs of an OpenAI query against a local knowledge base for 1 year with and without RAG
Forecast Usage Costs Assumptions
Use the same RAG-based prompt for all LLM run requests to maintain cost uniformity
Leverage synthetic usage data
# Function to calculate LLM API input token usage costs based on synthetic usage data
def calculate_total_token_usage_and_costs(message_text,
knowledge_base_text,
llm_app_usage,
token_costs_func=get_token_costs_for_all_models):
# Get per-run token cost data for the Message text
text = message_text
message_token_costs_df = token_costs_func(text)
# Update the DataFrame in place and set token columns to "message" token columns
message_token_costs_df.rename(columns={'token_count': 'message_token_count'},
inplace=True)
message_token_costs_df.rename(columns={'token_cost': 'message_token_cost'},
inplace=True)
# Get per-run token cost data for the Knowledge Base text
text = knowledge_base_text
kb_token_costs_df = token_costs_func(text)
# Update DataFrame in place, set token columns to "knowledge_base" token columns
kb_token_costs_df.rename(columns={'token_count': 'kb_token_count'}, inplace=True)
kb_token_costs_df.rename(columns={'token_cost': 'kb_token_cost'}, inplace=True)
# Columns to append
cols_to_append = ['kb_token_count', 'kb_token_cost']
# Perform the merge
token_costs_df = pd.merge(message_token_costs_df,
kb_token_costs_df[['provider','model'] + cols_to_append],
on=['provider', 'model'],
how='left')
# Add a dummy key for cross join
llm_app_usage['key'] = 1
token_costs_df['key'] = 1
# Cross join usage with token costs per model
usage_expanded = pd.merge(llm_app_usage,
token_costs_df, on='key').drop(columns=['key'])
# Calculate total runs per day
usage_expanded['total_runs'] = (
usage_expanded['average_number_runs_per_user']
* usage_expanded['number_of_users'])
# Compute total Message token usage and cost per model per day
usage_expanded['message_total_tokens'] = (
usage_expanded['total_runs']
* usage_expanded['message_token_count'])
usage_expanded['message_total_cost'] = (
usage_expanded['total_runs']
* usage_expanded['message_token_cost'])
# Compute total Knowledge Base token usage and cost per model per day
usage_expanded['kb_total_tokens'] = (
usage_expanded['total_runs']
* usage_expanded['kb_token_count'])
usage_expanded['kb_total_cost'] = (
usage_expanded['total_runs']
* usage_expanded['kb_token_cost'])
# Select and order columns for output
usage_summary = usage_expanded[[
'date', 'provider', 'model',
'average_number_runs_per_user', 'number_of_users',
'total_runs', 'message_token_count', 'message_token_cost',
'kb_token_count', 'kb_token_cost',
'message_total_tokens', 'message_total_cost',
'kb_total_tokens', 'kb_total_cost'
]].copy()
usage_summary['Year'] = usage_summary['date'].dt.to_period('Y')
usage_summary['Quarter'] = usage_summary['date'].dt.to_period('Q')
usage_summary['Month'] = usage_summary['date'].dt.to_period('M')
usage_summary['Week'] = usage_summary['date'].dt.to_period('W')
return usage_summaryCreate sample user queries
# Create sample user queries
queries = [
"What was MyFakeCompany's revenue for the first quarter of fiscal 2025?",
"Can you explain the Spacejet platform mentioned in the report?",
"What are the expected revenue projections for the second quarter of fiscal 2025?",
"Can you explain the Odin platform?"
]7. Generate a RAG Message from a User Query
# Generate an LLM prompt based on a sample query
query = queries[0]
rag_prompt = generate_llm_prompt(query)8. Forecast 1 year of OpenAI API Cost for a Query
Define a function to Expand Input Prompt to Input Message **Function to convert RAG-enabled "content" and knowledge base "content" to OpenAI GPT model format (role, content)
Key Points:
OpenAI GPT model API requires input be configured as a "message" with "role=[role]" and "content=[prompt]" key-value pairs.
This will ensure accurate token counts and corresponding token cost.
This resulting token count will be higher than the total number of tokens derived from only the prompt text.
# Get LLM API usage costs for the current prompt and full knowledge base text
# - Requires adding the roleformat
def get_usage_cost(llm_role, rag_prompt, knowledge_base_prompt):
rag_message = [
{
"role": llm_role,
"content": rag_prompt,
},
]
kb_message = [
{
"role": llm_role,
"content": knowledge_base_prompt,
},
]
usage_costs_df = calculate_total_token_usage_and_costs(rag_message, kb_message, llm_app_usage)
return usage_costs_dfStart Calculating Tokens and Cost
Calculate the tokens, token costs and usage data for the RAG message and the knowledge base message
Load the results into a dataframe for analysis
usage_costs_df = get_usage_cost(llm_role, rag_prompt, knowledge_base_text)9. Observations and Insights
Define a function that prepares a dataframe for analysis including aggregating token costs by period
# Define a function that prepares a dataframe for aggregating token costs by period
def cost_summary(period="Month", gpt_model="gpt-4o-2024-08-06"):
period = period.lower()
match period:
case "year":
period = "Year"
case "month":
period = "Month"
case "week":
period = "Week"
case "day":
period = "Day"
df = usage_costs_df.copy()
# Filter on the GPT model
filtered_df = df[df['model'] == gpt_model]
# Group by period and sum both cost columns
cost_summary = (
filtered_df.groupby(period).agg(
message_token_count=('message_token_count', 'max'),
message_token_cost=('message_token_cost', 'max'),
kb_token_count=('kb_token_count', 'max'),
kb_token_cost=('kb_token_cost', 'max'),
message_total_cost=('message_total_cost', 'sum'),
kb_total_cost=('kb_total_cost', 'sum')
).reset_index() )
# Round the cost columns to integers
cost_summary['message_total_cost'] = ( cost_summary['message_total_cost']
.round(0).astype(int) )
cost_summary['kb_total_cost'] = ( cost_summary['kb_total_cost']
.round(0).astype(int) )
# Rename columns
cost_summary.rename(columns={
'message_total_cost': 'Total Cost with RAG',
'kb_total_cost': 'Total Cost without RAG',
'message_token_cost': '1 Query Cost with RAG',
'kb_token_cost': '1 Query Cost without RAG',
'message_token_count': '1 Query Tokens with RAG',
'kb_token_count': '1 Query Tokens without RAG' }, inplace=True)
# Calculate cost without RAG (knowledge base) relative ratio
cost_summary['Cost without RAG Relative Ratio'] = (
cost_summary['1 Query Cost without RAG']
/ cost_summary['1 Query Cost with RAG'] )
# Calculate percent change from 'Message Cost' to 'Cost without RAG'
cost_summary['Cost Percent Change'] = (
((cost_summary['1 Query Cost without RAG']
- cost_summary['1 Query Cost with RAG'])
/ cost_summary['1 Query Cost with RAG']) * 100 )
# Optional: Round for readability
cost_summary['Cost without RAG Relative Ratio'] = (
cost_summary['Cost without RAG Relative Ratio'].round(2))
cost_summary['Cost Percent Change'] = (
cost_summary['Cost Percent Change'].round(2))
return cost_summaryDefine a function that creates a bar chart of token costs with options for period and sorting by a metric.
# Define a function that creates a bar chart of token costs by period
def plot_costs_comparison( period='Month',sorted_by='none'):
summary_df = cost_summary(period)
match sorted_by:
case 'none':
df_long = summary_df
case 'Total Cost without RAG':
# Sort DataFrame by 'Cost without RAG' descending
summary_df = summary_df.sort_values(by=sorted_by, ascending=False)
# Melt for seaborn
df_long = pd.melt(
summary_df,
id_vars=period,
value_vars=['Total Cost with RAG', 'Total Cost without RAG'],
var_name='Cost Type',
value_name='Total Cost'
)
plt.figure(figsize=(12, 6))
ax = sns.barplot(data=df_long, x=period, y='Total Cost', hue='Cost Type')
# Add the main title
plt.suptitle(f'Cost Forecast by {period}: OpenAI LLM API - RAG vs Without RAG',
fontsize=16, fontweight='bold')
# Add the subtitle
ax.set_title(f'Cost without RAG: {rel_ratio_cost}x higher, GPT Model : {gpt_4o}',
fontsize=12, style='normal')
# Add a multi-line footer
footer_text = (f"Sample Usage Data: daily users=50-200, daily LLM runs=15-25, "
f"Business days Jan 1 2025 - Dec 31 2025\n"
f"Internal knowledge document tokens={kb_tokens}, "
f"RAG-based message tokens={message_tokens}")
plt.figtext(0.5, -0.05, footer_text, ha='center',
fontsize=12, alpha=0.7, linespacing=1.5)
# Annotate bars
for container in ax.containers:
ax.bar_label(container, fmt='${:,.0f}', padding=3)
plt.xlabel(period)
plt.ylabel('Total Cost (USD)')
plt.xticks(rotation=45)
plt.legend(title='Cost Type')
plt.tight_layout()
plt.show()Show OpenAI API token counts and cost for 1 user query without RAG versus with RAG.
summary_df = cost_summary("Year")
message_tokens = summary_df.at[0,'1 Query Tokens with RAG']
message_cost = summary_df.at[0,'1 Query Cost with RAG']
kb_tokens = summary_df.at[0,'1 Query Tokens without RAG']
kb_cost = summary_df.at[0,'1 Query Cost without RAG']
rel_ratio_cost = summary_df.at[0,'Cost without RAG Relative Ratio']
pct_change_cost = summary_df.at[0,'Cost Percent Change']
print("Tokens and cost of 1 user query:")
print(f"With RAG - Tokens. : {message_tokens:,.0f}")
print(f"With RAG - Cost : ${message_cost:,.7f}")
print(f"Without RAG - Tokens : {kb_tokens:,.0f}")
print(f"Without RAG - Cost : ${kb_cost:,.7f}")
print(f"Relative Cost Ratio : {rel_ratio_cost:,.2f}")
print(f"Percent Change : {pct_change_cost:,.2f}%")This should output the following:
Tokens and cost of 1 user query: With RAG - Tokens. : 345 With RAG - Cost : $0.0008630 Without RAG - Tokens : 3,949 Without RAG - Cost : $0.0098730 Relative Cost Ratio : 11.44 Percent Change : 1,044.03%
The following are the OpenAI API token counts and cost for 1 user query without RAG versus with RAG:
The number of tokens in the financial report is 3,949.
The cost of using the financial report as input to OpenAI is $0.0098730
The number of tokens in the RAG message is 345.
The cost of using the RAG message as input to OpenAI is $0.0008630.
The cost without RAG is approximately 11 times higher than the cost with RAG.
Create a bar chart of token costs by year
plot_costs_comparison("Year")
Create a bar chart of token costs by month
plot_costs_comparison("Month")
Insights Summary
The following are the OpenAI API token counts and cost for the same user query forecasted over a 1 year period:
Without RAG:
Total 1 year cost : $6,266
The highest total cost for a month : $608
The lowest total cost for a month : $405
With RAG
Total 1 year cost : $548
The highest total cost for a month : $53
The lowest total cost for month : $35
10. Conclusions
The organization should leverage the RAG AI framework to enable their users to ask questions about recently published financial reports.
Estimates show RAG could be approximately 10 times less costly than not using a RAG approach.
Estimates show total cost savings can be into the thousands per year and likely over $6,000 depending on the number of users and how often they use the generative aI application with a proprietary LLM.
The organization should leverage and customize this cost analysis tool to develop cost optimization strategies for proprietary LLMS in their production environments.
Bibliography
OpenAI
Need Help with Your AI Initiative?
Our AI Strategy Advisory Services can help you leverage the latest AI technologies as a force multiplier across your organization. Contact Us to discuss how RAG can make AI scalable and cost-efficient for your business.
