- Build Club
- Posts
- Build Club Banter: Embeddings Part 2
Build Club Banter: Embeddings Part 2
Part 2: Embeddings
This article is brought to you by Build Club, the best place in the world to build in AI. We are top AI builders, shipping together. You can join us here
Contributors: Nissan Dookeran, Daniel Han, Isabelle De Backer, Daniel Bertram, 🤓 Daniel Wirjo
Last article we explored vector databases….Vector databases are databases of embeddings that you can search. So now, let’s deep dive into embeddings
Introduction
Embeddings in machine learning are high-dimensional, continuous vector representations. They encode complex data structures such as words, images, or graphs, into a dense format of a list of numbers that captures semantic or relational information. For example “Hello there” might be encoded as [-0.134, 0.834] whilst “Build Club” is [3.234, -0.452]. The essence of an embedding lies in its ability to translate intricate, often non-numeric attributes into the numerical space, while preserving inherent relationships and properties. This transformation then enables mathematical operations and comparisons that are otherwise challenging in the original space. Embeddings are generated through various techniques, with recent methods driven by neural network architectures like autoencoders or language models in natural language processing (NLP). Recent models are trained on large datasets, enabling the embedding vectors to capture subtle patterns and relationships.
The power of embeddings is evident in similarity search, clustering, and retrieval-augmented generation. In similarity search, embeddings allow for efficient comparisons between entities. By representing items as vectors, similarity metrics such as cosine similarity or Euclidean distance can be employed to find closely related items in the embedding space. This approach is foundational in systems like recommendation engines and document retrieval, where the goal is to find items similar to a user's preferences or query. In clustering, embeddings enable grouping of similar entities, often uncovering latent structures within the data. Techniques like K-means or hierarchical clustering can operate on these dense vectors to identify clusters, aiding in data exploration and pattern recognition.
Embeddings in LLM Applications
Large Language Models (LLMs) internally convert natural language inputs into proprietary embeddings, a process typically outside the control of AI engineers. The focus for these engineers lies instead in the application and design of external embeddings, particularly in retrieval systems and clustering tasks. For clustering, (for example in summarization https://arxiv.org/abs/2401.18059, or finding patterns in large amount of data), in retrieval applications, in search engines, the right embedding selection significantly influences the effectiveness and accuracy of the machine learning applications, underscoring the AI engineer's role in optimizing embeddings for specific tasks.
Choosing existing embeddings
The quality of text embeddings is highly dependent on the embedding model used. MTEB (Massive Text Embedding Benchmark), a massive benchmark for measuring the performance of text embedding models is designed to help you find the best embedding model out there for a variety of tasks!
Live doc here (Credit Daniel Han)
The cool stuff about it: using the MTEB library, you can benchmark any model that produces embeddings and add its results to the public leaderboard!
Fine-tuning embeddings
Of course you can also fine tune your embedding model (which is different from fine-tuning the LLM itself) to improve RAG accuracy and recall. For example, this blog post goes over the fine tuning of embeddings and compares it with fine-tuning of LLMs.
And that’s it! We hoped this article helped you understand embeddings and the different models.
Stay tuned for our next series where we deep-dive into the wonderful world of RAG!