Intro to Large Language Models (LLMs): concepts, tooling, and pitfalls
Sep 12, 2025•
aillmnlprag
• 0
Large Language Models (LLMs) are probability machines trained to predict the next token (piece of text) given context. With enough data and compute, they learn useful capabilities: reasoning, translation, extraction, code generation, and conversation.
Key concepts in 3 minutes
- Tokens: Models read/write tokens, not characters. Cost and limits are per token.
- Context window: Max number of tokens a model can consider at once. Long context helps, but retrieval is still essential.
- Embeddings: Fixed‑length vectors that represent text meaning. Used for search, clustering, and Retrieval‑Augmented Generation (RAG).
- RAG: Fetch relevant snippets from your data (via embeddings + vector DB) and provide them as context to the model.
- Fine‑tuning: Train the model on your examples to nudge behavior; best for narrow formats/style, not new knowledge.
- Function/tool calling: Let the model request tools (DB queries, APIs) and you execute them.
When to use LLMs
Great for:
- Natural language interfaces, summarization, structured extraction
- Drafting emails/docs, writing tests, code review assistants
- Search and Q&A over private data (with RAG)
Not ideal for:
- Exact arithmetic, hard real‑time constraints, or tasks with zero tolerance for hallucinations without guardrails.
Minimal building blocks
- Prompting: Give clear instructions, role, format, and examples.
- Retrieval: Embed your documents and retrieve top‑k relevant chunks.
- Guardrails: Validate output, enforce JSON schemas, check policies.
- Evaluation: Measure quality with golden tests and automated judges.
Prompting template (baseline)
System: You are a concise, truthful assistant. If unsure, say you don't know.
User: {question}
Context (non‑public):
{top_k_snippets}
Instructions:
- Cite sources by title.
- Return JSON with fields: answer, sources.
Retrieval pipeline (pseudocode)
from my_embeddings import embed
from my_llm import generate
def answer(query):
q_vec = embed(query)
docs = vector_store.similarity_search(q_vec, top_k=5)
prompt = TEMPLATE.format(question=query, top_k_snippets=format_docs(docs))
out = generate(prompt, response_format={"type": "json_object"})
return validate(out)
Choosing a model
- Start with a capable general model (GPT‑4o‑mini/4.1‑mini, Claude‑3.5 Sonnet, Llama‑3.1 70B). Use smaller models for low‑latency or edge.
- Consider latency, cost per 1K tokens, context length, and tool‑use quality.
Fine‑tuning vs RAG
- Use RAG when answers depend on your changing knowledge base.
- Use fine‑tuning for formatting/style or to reduce prompt complexity; keep training sets small but clean.
Output control
- Ask for structured output (JSON) and validate against a schema.
- Post‑process with deterministic code; never blindly trust free‑text.
Hallucinations and safety
- Provide sufficient context; add citations and confidence.
- Add refusal rules (no medical/legal advice), profanity filters, and PII scrubbing where needed.
Evaluation (don’t skip this)
- Create a test set of real prompts + expected outputs.
- Use metrics: accuracy/F1 for extraction, preference votes for generation, latency, and cost.
- Run evals on every change (like unit tests for prompts).
Cost & performance tips
- Chunking: 300–800 tokens per chunk with overlap works well for RAG.
- Caching: Cache embeddings and successful completions.
- Streaming: Stream tokens to improve UX.
- Batching: Batch embeddings and tool calls.
Quick start (generic HTTP call)
curl https://api.llm.example/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [
{"role":"system","content":"You are concise and truthful."},
{"role":"user","content":"Give me 3 bullet points on vector search."}
],
"temperature": 0.2
}'
Checklist for production LLM features
- Clear prompts with examples and strict output schemas
- Retrieval with high‑quality embeddings and chunking
- JSON validation + fallbacks + retries
- Telemetry (prompt, latency, tokens, cost, outcomes)
- Evals and canary rollouts before enabling for all users
- Red‑team tests and safety filters
LLMs are powerful generalists. Pair them with retrieval, guardrails, and rigorous evaluation, and you can ship reliable, delightful features.