Intro to Large Language Models (LLMs): concepts, tooling, and pitfalls

Sep 12, 2025
aillmnlprag
0

Large Language Models (LLMs) are probability machines trained to predict the next token (piece of text) given context. With enough data and compute, they learn useful capabilities: reasoning, translation, extraction, code generation, and conversation.

Key concepts in 3 minutes

  • Tokens: Models read/write tokens, not characters. Cost and limits are per token.
  • Context window: Max number of tokens a model can consider at once. Long context helps, but retrieval is still essential.
  • Embeddings: Fixed‑length vectors that represent text meaning. Used for search, clustering, and Retrieval‑Augmented Generation (RAG).
  • RAG: Fetch relevant snippets from your data (via embeddings + vector DB) and provide them as context to the model.
  • Fine‑tuning: Train the model on your examples to nudge behavior; best for narrow formats/style, not new knowledge.
  • Function/tool calling: Let the model request tools (DB queries, APIs) and you execute them.

When to use LLMs

Great for:

  • Natural language interfaces, summarization, structured extraction
  • Drafting emails/docs, writing tests, code review assistants
  • Search and Q&A over private data (with RAG)

Not ideal for:

  • Exact arithmetic, hard real‑time constraints, or tasks with zero tolerance for hallucinations without guardrails.

Minimal building blocks

  1. Prompting: Give clear instructions, role, format, and examples.
  2. Retrieval: Embed your documents and retrieve top‑k relevant chunks.
  3. Guardrails: Validate output, enforce JSON schemas, check policies.
  4. Evaluation: Measure quality with golden tests and automated judges.

Prompting template (baseline)

System: You are a concise, truthful assistant. If unsure, say you don't know.
User: {question}
Context (non‑public):
{top_k_snippets}
Instructions:
- Cite sources by title.
- Return JSON with fields: answer, sources.

Retrieval pipeline (pseudocode)

from my_embeddings import embed
from my_llm import generate

def answer(query):
    q_vec = embed(query)
    docs = vector_store.similarity_search(q_vec, top_k=5)
    prompt = TEMPLATE.format(question=query, top_k_snippets=format_docs(docs))
    out = generate(prompt, response_format={"type": "json_object"})
    return validate(out)

Choosing a model

  • Start with a capable general model (GPT‑4o‑mini/4.1‑mini, Claude‑3.5 Sonnet, Llama‑3.1 70B). Use smaller models for low‑latency or edge.
  • Consider latency, cost per 1K tokens, context length, and tool‑use quality.

Fine‑tuning vs RAG

  • Use RAG when answers depend on your changing knowledge base.
  • Use fine‑tuning for formatting/style or to reduce prompt complexity; keep training sets small but clean.

Output control

  • Ask for structured output (JSON) and validate against a schema.
  • Post‑process with deterministic code; never blindly trust free‑text.

Hallucinations and safety

  • Provide sufficient context; add citations and confidence.
  • Add refusal rules (no medical/legal advice), profanity filters, and PII scrubbing where needed.

Evaluation (don’t skip this)

  • Create a test set of real prompts + expected outputs.
  • Use metrics: accuracy/F1 for extraction, preference votes for generation, latency, and cost.
  • Run evals on every change (like unit tests for prompts).

Cost & performance tips

  • Chunking: 300–800 tokens per chunk with overlap works well for RAG.
  • Caching: Cache embeddings and successful completions.
  • Streaming: Stream tokens to improve UX.
  • Batching: Batch embeddings and tool calls.

Quick start (generic HTTP call)

curl https://api.llm.example/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "my-llm",
        "messages": [
          {"role":"system","content":"You are concise and truthful."},
          {"role":"user","content":"Give me 3 bullet points on vector search."}
        ],
        "temperature": 0.2
      }'

Checklist for production LLM features

  • Clear prompts with examples and strict output schemas
  • Retrieval with high‑quality embeddings and chunking
  • JSON validation + fallbacks + retries
  • Telemetry (prompt, latency, tokens, cost, outcomes)
  • Evals and canary rollouts before enabling for all users
  • Red‑team tests and safety filters

LLMs are powerful generalists. Pair them with retrieval, guardrails, and rigorous evaluation, and you can ship reliable, delightful features.