AI TRAINING
Embeddings & Semantic Search Foundations
Build a working semantic search system using embeddings, similarity indices, and reranking techniques.
See if this training is the right one for your team, free diagnostic
Run the diagnostic →What it covers
This hands-on training covers the full pipeline for semantic search: from selecting and fine-tuning embedding models to chunking strategies, vector indexing, and reranking. Participants will build a functional semantic search prototype by the end of the session. The format combines short concept modules with guided coding labs, targeting engineers and data practitioners who want to move beyond keyword search. Learners leave with reusable code patterns and a clear mental model for integrating semantic search into production systems.
What you'll be able to do
- Select and justify the right embedding model for a given domain and latency budget
- Design and implement a chunking pipeline that preserves semantic coherence across document types
- Build and query a vector index (FAISS or Qdrant) from scratch in Python
- Add a cross-encoder reranker to a bi-encoder retrieval pipeline and measure the quality uplift
- Evaluate retrieval quality using MRR and Recall@K on a labelled test set
Topics covered
- Embedding model taxonomy: dense vs sparse, open-source vs API-based
- Text chunking strategies: fixed-size, sentence, semantic, and recursive splitting
- Vector similarity metrics: cosine, dot product, Euclidean, trade-offs
- Vector databases and ANN indexes: FAISS, Qdrant, Weaviate, pgvector
- Approximate nearest-neighbour search algorithms (HNSW, IVF)
- Reranking with cross-encoders and bi-encoder pipelines
- Evaluation metrics for retrieval quality: MRR, NDCG, Recall@K
- Production considerations: latency, scaling, hybrid search (BM25 + dense)
Delivery
Typically delivered over 2-3 days in-person or live-virtual (Zoom/Teams). Each half-day block pairs a 30-minute concept session with a 90-minute guided coding lab using Jupyter notebooks. Participants receive a GitHub repo with starter code, pre-indexed datasets, and solution branches. Remote delivery works well; in-person is preferred for the debugging-heavy indexing labs. A cloud sandbox (Google Colab Pro or provisioned GPU instance) is provided so participants can run experiments without local setup friction.
What makes it work
- Start with a real internal document corpus during labs, participants retain far more when data is familiar
- Benchmark hybrid search (BM25 + dense) against pure dense from day one to build intuition
- Pair engineers with a data owner who can label a small golden test set for immediate evaluation
- Follow up with a short architecture review session 2-4 weeks post-training to unblock production decisions
Common mistakes
- Using a single generic embedding model across all domains without evaluating domain-specific alternatives
- Ignoring chunk size and overlap tuning, leading to poor retrieval precision on long documents
- Skipping reranking entirely and assuming ANN retrieval quality is sufficient for production
- Neglecting evaluation: shipping semantic search without a labelled test set or baseline comparison
When NOT to take this
A team that has not yet shipped any ML model to production and is still debating whether to use AI at all, they need an AI literacy or use-case scoping workshop first, not a hands-on embeddings bootcamp.
Providers to consider
Sources
Use cases this training unlocks
- Enterprise Knowledge Graph and Semantic SearchConnect documents, code, and conversations into a searchable knowledge graph for knowledge workers.
- Contextual Content Discovery EngineSurface the right content to each user by combining NLP, mood, and real-time context signals.
- AI Legal Research AssistantAccelerate legal research for lawyers by surfacing relevant case law, statutes, and citations instantly.
- Hyper-Personalized Content Recommendation EngineBoost engagement by surfacing the right content to each user at the right moment.
- AI Patient Matching for Clinical TrialsMatch eligible patients to clinical trials automatically by parsing medical records against study criteria.
- Podcast Discovery and Episode MatchingMatch listeners to relevant podcasts and episodes using NLP-driven preference analysis.
Other trainings at this level
This training is part of a Data & AI catalog built for leaders serious about execution. Take the free diagnostic to see which trainings your team needs.