What is LangSmith? Complete Guide to LLM Observability

Introduction to LangSmith

LangSmith is a developer platform created by LangChain for debugging, testing, evaluating, and monitoring LLM applications. Think of it as the "DevTools" for AI applications - just like how browser developer tools help you debug web applications, LangSmith helps you understand what's happening inside your LLM-powered systems.

When you build applications with LLMs, a lot happens behind the scenes: prompts are constructed, models are called, responses are parsed, and chains of operations execute. Without proper visibility into this process, debugging issues or understanding why your AI behaves a certain way becomes nearly impossible. LangSmith solves this problem.

Why LangSmith Exists

Building with LLMs presents unique challenges that traditional debugging tools weren't designed for:

The Black Box Problem

LLMs are inherently unpredictable. The same prompt might produce different outputs, and it's often unclear why an agent took a particular action. LangSmith provides complete visibility into every step of execution.

Complex Chains and Agents

Modern AI applications involve multiple LLM calls, tool executions, and conditional logic. When something goes wrong in a 10-step agent workflow, finding the problem without proper tracing is like finding a needle in a haystack.

Quality Measurement

How do you know if your AI is getting better? Traditional software has tests that pass or fail. LLM outputs are nuanced and require systematic evaluation frameworks.

Production Monitoring

Once deployed, you need to track costs, latency, errors, and user satisfaction. LangSmith provides production-grade observability for AI systems.

Core Concepts

1. Tracing

Tracing is the foundation of LangSmith. Every operation in your LLM application is captured as a "run" with complete details:

import os
from langsmith import traceable
from langchain_openai import ChatOpenAI

# Enable tracing (set these environment variables)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"

# Your LangChain code is automatically traced
llm = ChatOpenAI()
response = llm.invoke("Explain machine learning")

# View the trace in LangSmith dashboard showing:
# - Input prompt
# - Output response
# - Latency
# - Token usage
# - Cost

With the @traceable decorator, you can also trace your own custom functions:

from langsmith import traceable

@traceable
def my_ai_function(user_query):
    # All operations inside are automatically traced
    context = retrieve_context(user_query)
    prompt = build_prompt(user_query, context)
    response = llm.invoke(prompt)
    return response

2. Projects

Projects organize your traces. You might have separate projects for development, staging, and production, or for different features of your application:

# Specify which project to log traces to
os.environ["LANGCHAIN_PROJECT"] = "my-chatbot-dev"

# Or use different projects for different environments
if environment == "production":
    os.environ["LANGCHAIN_PROJECT"] = "my-chatbot-prod"
else:
    os.environ["LANGCHAIN_PROJECT"] = "my-chatbot-dev"

3. Datasets

Datasets are collections of input-output examples used for testing and evaluation. You can create them manually, from production traces, or from existing data:

from langsmith import Client

client = Client()

# Create a dataset
dataset = client.create_dataset("qa-test-cases")

# Add examples
client.create_example(
    inputs={"question": "What is the capital of France?"},
    outputs={"answer": "Paris"},
    dataset_id=dataset.id
)

client.create_example(
    inputs={"question": "Who wrote Romeo and Juliet?"},
    outputs={"answer": "William Shakespeare"},
    dataset_id=dataset.id
)

4. Evaluations

Evaluations let you systematically measure how well your AI performs. LangSmith supports custom evaluators and built-in metrics:

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# Define what you want to evaluate
def my_qa_system(inputs):
    return {"answer": llm.invoke(inputs["question"]).content}

# Run evaluation with different metrics
results = evaluate(
    my_qa_system,
    data="qa-test-cases",
    evaluators=[
        LangChainStringEvaluator("correctness"),
        LangChainStringEvaluator("helpfulness"),
    ],
    experiment_prefix="v1-gpt4"
)

# Compare different versions
results_v2 = evaluate(
    my_qa_system_v2,
    data="qa-test-cases",
    evaluators=[...],
    experiment_prefix="v2-claude"
)

5. Annotation Queues

For human evaluation, LangSmith provides annotation queues where team members can review and score AI outputs:

Human feedback: Collect ratings, corrections, and comments
Quality assurance: Review production outputs for issues
Training data: Generate high-quality examples for fine-tuning

6. Monitoring Dashboard

The LangSmith dashboard provides real-time visibility into your production system:

Trace counts: How many requests per hour/day
Latency metrics: P50, P95, P99 response times
Error rates: Failed requests and error types
Cost tracking: Token usage and API costs
Feedback scores: User satisfaction trends

When to Use LangSmith

Development & Debugging

Understand exactly what prompts are sent, what responses come back, and where errors occur in your chains.

Testing & CI/CD

Run automated evaluations on every code change to catch regressions before they reach production.

Prompt Engineering

Compare different prompt versions systematically to find what works best for your use case.

Production Monitoring

Track costs, latency, and errors in real-time. Get alerts when things go wrong.

Model Comparison

Evaluate different models (GPT-4 vs Claude vs open-source) on your specific tasks.

Team Collaboration

Share traces with teammates, annotate issues, and build datasets together.

Getting Started

Setting up LangSmith is straightforward:

# 1. Sign up at smith.langchain.com and get your API key

# 2. Install the SDK
pip install langsmith

# 3. Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-api-key
export LANGCHAIN_PROJECT=my-project

# 4. Your LangChain code is automatically traced!
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()
response = llm.invoke("Hello!")
# Check smith.langchain.com to see the trace

Using LangSmith Without LangChain

LangSmith works with any LLM code, not just LangChain:

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable(run_type="llm")
def call_openai(prompt):
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# This function call will be traced in LangSmith
result = call_openai("What is 2+2?")

Key Features Deep Dive

Trace Visualization

LangSmith displays traces as a tree structure, showing parent-child relationships between operations. For an agent that:

Receives a user question
Decides to search the web
Calls the search tool
Synthesizes an answer

You'll see each step with its inputs, outputs, and timing. Click any node to see the full prompt and response.

Comparing Experiments

When you run evaluations, LangSmith lets you compare results side-by-side:

See which version performed better on each test case
Identify specific examples where one model excels
Track improvement trends over time

Filtering and Search

Find specific traces quickly with powerful filters:

By time range
By latency (e.g., "show me slow requests")
By error status
By feedback score
By custom metadata tags

Best Practices

Use meaningful project names: Organize by feature, environment, or team
Add metadata to traces: Include user IDs, session IDs, or feature flags for better filtering
Build golden datasets early: Collect examples of good inputs/outputs as you develop
Run evaluations in CI: Catch regressions before they reach production
Monitor production daily: Set up alerts for error spikes or latency increases
Review traces regularly: Even successful requests might reveal optimization opportunities
Collect user feedback: Use thumbs up/down or ratings to track real-world satisfaction

LangSmith vs Alternatives

While there are other observability tools, LangSmith stands out for:

Native LangChain integration: Zero-config tracing for LangChain apps
Complete workflow: Tracing + datasets + evaluations + monitoring in one platform
LLM-specific features: Built for AI applications, not generic APM
Generous free tier: Suitable for development and small-scale production

Alternatives include Weights & Biases (Weave), Arize Phoenix, and Helicone, each with their own strengths.

Master LLM Observability with Expert Guidance

Our Agentic AI program teaches you to build production-ready AI applications, including proper observability and monitoring with LangSmith. Learn to debug, test, and optimize your AI systems with hands-on projects.

Explore Agentic AI Program

What is LangSmith?

Introduction to LangSmith

Why LangSmith Exists

The Black Box Problem

Complex Chains and Agents

Quality Measurement

Production Monitoring

Core Concepts

1. Tracing

2. Projects

3. Datasets

4. Evaluations

5. Annotation Queues

6. Monitoring Dashboard

When to Use LangSmith

Development & Debugging

Testing & CI/CD

Prompt Engineering

Production Monitoring

Model Comparison

Team Collaboration

Getting Started

Using LangSmith Without LangChain

Key Features Deep Dive

Trace Visualization

Comparing Experiments

Filtering and Search

Best Practices

LangSmith vs Alternatives

Master LLM Observability with Expert Guidance

Related Articles

LangChain: Building LLM Applications

LangGraph: Multi-Actor AI Applications

Prompt Engineering: The Complete Guide