Intro to Large Language Models (LLMs): concepts, tooling, and pitfalls

Sep 12, 2025•

aillmnlprag

•

Large Language Models (LLMs) are probability machines trained to predict the next token (piece of text) given context. With enough data and compute, they learn useful capabilities: reasoning, translation, extraction, code generation, and conversation.

Key concepts in 3 minutes

Tokens: Models read/write tokens, not characters. Cost and limits are per token.
Context window: Max number of tokens a model can consider at once. Long context helps, but retrieval is still essential.
Embeddings: Fixed‑length vectors that represent text meaning. Used for search, clustering, and Retrieval‑Augmented Generation (RAG).
RAG: Fetch relevant snippets from your data (via embeddings + vector DB) and provide them as context to the model.
Fine‑tuning: Train the model on your examples to nudge behavior; best for narrow formats/style, not new knowledge.
Function/tool calling: Let the model request tools (DB queries, APIs) and you execute them.

When to use LLMs

Great for:

Natural language interfaces, summarization, structured extraction
Drafting emails/docs, writing tests, code review assistants
Search and Q&A over private data (with RAG)

Not ideal for:

Exact arithmetic, hard real‑time constraints, or tasks with zero tolerance for hallucinations without guardrails.

Minimal building blocks

Prompting: Give clear instructions, role, format, and examples.
Retrieval: Embed your documents and retrieve top‑k relevant chunks.
Guardrails: Validate output, enforce JSON schemas, check policies.
Evaluation: Measure quality with golden tests and automated judges.

Prompting template (baseline)

System: You are a concise, truthful assistant. If unsure, say you don't know.
User: {question}
Context (non‑public):
{top_k_snippets}
Instructions:
- Cite sources by title.
- Return JSON with fields: answer, sources.

Retrieval pipeline (pseudocode)

from my_embeddings import embed
from my_llm import generate

def answer(query):
    q_vec = embed(query)
    docs = vector_store.similarity_search(q_vec, top_k=5)
    prompt = TEMPLATE.format(question=query, top_k_snippets=format_docs(docs))
    out = generate(prompt, response_format={"type": "json_object"})
    return validate(out)

Choosing a model

Start with a capable general model (GPT‑4o‑mini/4.1‑mini, Claude‑3.5 Sonnet, Llama‑3.1 70B). Use smaller models for low‑latency or edge.
Consider latency, cost per 1K tokens, context length, and tool‑use quality.

Fine‑tuning vs RAG

Use RAG when answers depend on your changing knowledge base.
Use fine‑tuning for formatting/style or to reduce prompt complexity; keep training sets small but clean.

Output control

Ask for structured output (JSON) and validate against a schema.
Post‑process with deterministic code; never blindly trust free‑text.

Hallucinations and safety

Provide sufficient context; add citations and confidence.
Add refusal rules (no medical/legal advice), profanity filters, and PII scrubbing where needed.

Evaluation (don’t skip this)

Create a test set of real prompts + expected outputs.
Use metrics: accuracy/F1 for extraction, preference votes for generation, latency, and cost.
Run evals on every change (like unit tests for prompts).

Cost & performance tips

Chunking: 300–800 tokens per chunk with overlap works well for RAG.
Caching: Cache embeddings and successful completions.
Streaming: Stream tokens to improve UX.
Batching: Batch embeddings and tool calls.

Quick start (generic HTTP call)

curl https://api.llm.example/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "my-llm",
        "messages": [
          {"role":"system","content":"You are concise and truthful."},
          {"role":"user","content":"Give me 3 bullet points on vector search."}
        ],
        "temperature": 0.2
      }'

Checklist for production LLM features

Clear prompts with examples and strict output schemas
Retrieval with high‑quality embeddings and chunking
JSON validation + fallbacks + retries
Telemetry (prompt, latency, tokens, cost, outcomes)
Evals and canary rollouts before enabling for all users
Red‑team tests and safety filters

LLMs are powerful generalists. Pair them with retrieval, guardrails, and rigorous evaluation, and you can ship reliable, delightful features.