Introduction to LangSmith
LangSmith is a developer platform created by LangChain for debugging, testing, evaluating, and monitoring LLM applications. Think of it as the "DevTools" for AI applications - just like how browser developer tools help you debug web applications, LangSmith helps you understand what's happening inside your LLM-powered systems.
When you build applications with LLMs, a lot happens behind the scenes: prompts are constructed, models are called, responses are parsed, and chains of operations execute. Without proper visibility into this process, debugging issues or understanding why your AI behaves a certain way becomes nearly impossible. LangSmith solves this problem.
Why LangSmith Exists
Building with LLMs presents unique challenges that traditional debugging tools weren't designed for:
The Black Box Problem
LLMs are inherently unpredictable. The same prompt might produce different outputs, and it's often unclear why an agent took a particular action. LangSmith provides complete visibility into every step of execution.
Complex Chains and Agents
Modern AI applications involve multiple LLM calls, tool executions, and conditional logic. When something goes wrong in a 10-step agent workflow, finding the problem without proper tracing is like finding a needle in a haystack.
Quality Measurement
How do you know if your AI is getting better? Traditional software has tests that pass or fail. LLM outputs are nuanced and require systematic evaluation frameworks.
Production Monitoring
Once deployed, you need to track costs, latency, errors, and user satisfaction. LangSmith provides production-grade observability for AI systems.
Core Concepts
1. Tracing
Tracing is the foundation of LangSmith. Every operation in your LLM application is captured as a "run" with complete details:
import os
from langsmith import traceable
from langchain_openai import ChatOpenAI
# Enable tracing (set these environment variables)
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
# Your LangChain code is automatically traced
llm = ChatOpenAI()
response = llm.invoke("Explain machine learning")
# View the trace in LangSmith dashboard showing:
# - Input prompt
# - Output response
# - Latency
# - Token usage
# - Cost
With the @traceable decorator, you can also trace your own custom functions:
from langsmith import traceable
@traceable
def my_ai_function(user_query):
# All operations inside are automatically traced
context = retrieve_context(user_query)
prompt = build_prompt(user_query, context)
response = llm.invoke(prompt)
return response
2. Projects
Projects organize your traces. You might have separate projects for development, staging, and production, or for different features of your application:
# Specify which project to log traces to
os.environ["LANGCHAIN_PROJECT"] = "my-chatbot-dev"
# Or use different projects for different environments
if environment == "production":
os.environ["LANGCHAIN_PROJECT"] = "my-chatbot-prod"
else:
os.environ["LANGCHAIN_PROJECT"] = "my-chatbot-dev"
3. Datasets
Datasets are collections of input-output examples used for testing and evaluation. You can create them manually, from production traces, or from existing data:
from langsmith import Client
client = Client()
# Create a dataset
dataset = client.create_dataset("qa-test-cases")
# Add examples
client.create_example(
inputs={"question": "What is the capital of France?"},
outputs={"answer": "Paris"},
dataset_id=dataset.id
)
client.create_example(
inputs={"question": "Who wrote Romeo and Juliet?"},
outputs={"answer": "William Shakespeare"},
dataset_id=dataset.id
)
4. Evaluations
Evaluations let you systematically measure how well your AI performs. LangSmith supports custom evaluators and built-in metrics:
from langsmith.evaluation import evaluate, LangChainStringEvaluator
# Define what you want to evaluate
def my_qa_system(inputs):
return {"answer": llm.invoke(inputs["question"]).content}
# Run evaluation with different metrics
results = evaluate(
my_qa_system,
data="qa-test-cases",
evaluators=[
LangChainStringEvaluator("correctness"),
LangChainStringEvaluator("helpfulness"),
],
experiment_prefix="v1-gpt4"
)
# Compare different versions
results_v2 = evaluate(
my_qa_system_v2,
data="qa-test-cases",
evaluators=[...],
experiment_prefix="v2-claude"
)
5. Annotation Queues
For human evaluation, LangSmith provides annotation queues where team members can review and score AI outputs:
- Human feedback: Collect ratings, corrections, and comments
- Quality assurance: Review production outputs for issues
- Training data: Generate high-quality examples for fine-tuning
6. Monitoring Dashboard
The LangSmith dashboard provides real-time visibility into your production system:
- Trace counts: How many requests per hour/day
- Latency metrics: P50, P95, P99 response times
- Error rates: Failed requests and error types
- Cost tracking: Token usage and API costs
- Feedback scores: User satisfaction trends
When to Use LangSmith
Development & Debugging
Understand exactly what prompts are sent, what responses come back, and where errors occur in your chains.
Testing & CI/CD
Run automated evaluations on every code change to catch regressions before they reach production.
Prompt Engineering
Compare different prompt versions systematically to find what works best for your use case.
Production Monitoring
Track costs, latency, and errors in real-time. Get alerts when things go wrong.
Model Comparison
Evaluate different models (GPT-4 vs Claude vs open-source) on your specific tasks.
Team Collaboration
Share traces with teammates, annotate issues, and build datasets together.
Getting Started
Setting up LangSmith is straightforward:
# 1. Sign up at smith.langchain.com and get your API key
# 2. Install the SDK
pip install langsmith
# 3. Set environment variables
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-api-key
export LANGCHAIN_PROJECT=my-project
# 4. Your LangChain code is automatically traced!
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()
response = llm.invoke("Hello!")
# Check smith.langchain.com to see the trace
Using LangSmith Without LangChain
LangSmith works with any LLM code, not just LangChain:
from langsmith import traceable
from openai import OpenAI
client = OpenAI()
@traceable(run_type="llm")
def call_openai(prompt):
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
# This function call will be traced in LangSmith
result = call_openai("What is 2+2?")
Key Features Deep Dive
Trace Visualization
LangSmith displays traces as a tree structure, showing parent-child relationships between operations. For an agent that:
- Receives a user question
- Decides to search the web
- Calls the search tool
- Synthesizes an answer
You'll see each step with its inputs, outputs, and timing. Click any node to see the full prompt and response.
Comparing Experiments
When you run evaluations, LangSmith lets you compare results side-by-side:
- See which version performed better on each test case
- Identify specific examples where one model excels
- Track improvement trends over time
Filtering and Search
Find specific traces quickly with powerful filters:
- By time range
- By latency (e.g., "show me slow requests")
- By error status
- By feedback score
- By custom metadata tags
Best Practices
- Use meaningful project names: Organize by feature, environment, or team
- Add metadata to traces: Include user IDs, session IDs, or feature flags for better filtering
- Build golden datasets early: Collect examples of good inputs/outputs as you develop
- Run evaluations in CI: Catch regressions before they reach production
- Monitor production daily: Set up alerts for error spikes or latency increases
- Review traces regularly: Even successful requests might reveal optimization opportunities
- Collect user feedback: Use thumbs up/down or ratings to track real-world satisfaction
LangSmith vs Alternatives
While there are other observability tools, LangSmith stands out for:
- Native LangChain integration: Zero-config tracing for LangChain apps
- Complete workflow: Tracing + datasets + evaluations + monitoring in one platform
- LLM-specific features: Built for AI applications, not generic APM
- Generous free tier: Suitable for development and small-scale production
Alternatives include Weights & Biases (Weave), Arize Phoenix, and Helicone, each with their own strengths.
Master LLM Observability with Expert Guidance
Our Agentic AI program teaches you to build production-ready AI applications, including proper observability and monitoring with LangSmith. Learn to debug, test, and optimize your AI systems with hands-on projects.
Explore Agentic AI Program