AI TRAINING

Embeddings & Semantic Search Foundations

Build a working semantic search system using embeddings, similarity indices, and reranking techniques.

See if this training is the right one for your team, free diagnostic

Format: bootcamp
Duration: 14-24h
Level: practitioner
Group size: 6-16
Price / participant: €1K-€3K
Group price: €12K-€28K
Audience: Software engineers, ML engineers, and data scientists building search or retrieval features
Prerequisites: Proficiency in Python; familiarity with basic ML concepts and REST APIs; no prior vector DB experience required

What it covers

This hands-on training covers the full pipeline for semantic search: from selecting and fine-tuning embedding models to chunking strategies, vector indexing, and reranking. Participants will build a functional semantic search prototype by the end of the session. The format combines short concept modules with guided coding labs, targeting engineers and data practitioners who want to move beyond keyword search. Learners leave with reusable code patterns and a clear mental model for integrating semantic search into production systems.

What you'll be able to do

Select and justify the right embedding model for a given domain and latency budget
Design and implement a chunking pipeline that preserves semantic coherence across document types
Build and query a vector index (FAISS or Qdrant) from scratch in Python
Add a cross-encoder reranker to a bi-encoder retrieval pipeline and measure the quality uplift
Evaluate retrieval quality using MRR and Recall@K on a labelled test set

Topics covered

Embedding model taxonomy: dense vs sparse, open-source vs API-based
Text chunking strategies: fixed-size, sentence, semantic, and recursive splitting
Vector similarity metrics: cosine, dot product, Euclidean, trade-offs
Vector databases and ANN indexes: FAISS, Qdrant, Weaviate, pgvector
Approximate nearest-neighbour search algorithms (HNSW, IVF)
Reranking with cross-encoders and bi-encoder pipelines
Evaluation metrics for retrieval quality: MRR, NDCG, Recall@K
Production considerations: latency, scaling, hybrid search (BM25 + dense)

Delivery

Typically delivered over 2-3 days in-person or live-virtual (Zoom/Teams). Each half-day block pairs a 30-minute concept session with a 90-minute guided coding lab using Jupyter notebooks. Participants receive a GitHub repo with starter code, pre-indexed datasets, and solution branches. Remote delivery works well; in-person is preferred for the debugging-heavy indexing labs. A cloud sandbox (Google Colab Pro or provisioned GPU instance) is provided so participants can run experiments without local setup friction.

What makes it work

Start with a real internal document corpus during labs, participants retain far more when data is familiar
Benchmark hybrid search (BM25 + dense) against pure dense from day one to build intuition
Pair engineers with a data owner who can label a small golden test set for immediate evaluation
Follow up with a short architecture review session 2-4 weeks post-training to unblock production decisions

Common mistakes

Using a single generic embedding model across all domains without evaluating domain-specific alternatives
Ignoring chunk size and overlap tuning, leading to poor retrieval precision on long documents
Skipping reranking entirely and assuming ANN retrieval quality is sufficient for production
Neglecting evaluation: shipping semantic search without a labelled test set or baseline comparison

When NOT to take this

A team that has not yet shipped any ML model to production and is still debating whether to use AI at all, they need an AI literacy or use-case scoping workshop first, not a hands-on embeddings bootcamp.

Providers to consider

Sources

Use cases this training unlocks

Other trainings at this level

This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.

Run the diagnostic Book a call