Managing Embeddings

Modern machine learning models can be trained to convert raw data into embeddings, which are vectors of floating point numbers. The position of an embedding in vector space captures the semantics of the data, so vectors that are close to each other are considered similar. LanceDB provides an embedding function registry in OSS as well as its Cloud and Enterprise versions (see below). that automatically generates vector embeddings during data ingestion and querying. The API abstracts embedding generation, allowing you to focus on your application logic.

Embedding Registry

OSS In LanceDB OSS, you can get a supported embedding function from the registry, and then use it in your table schema. Once configured, the embedding function will automatically generate embeddings when you insert data into the table. And when you query the table, you can provide a query string or other input, and the embedding function will generate an embedding for it.

from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

# Get a sentence-transformer function
func = get_registry().get("sentence-transformers").create()

class MySchema(LanceModel):
    # Embed the 'text' field automatically
    text: str = func.SourceField()
    # Store the embeddings in the 'vector' field
    vector: Vector(func.ndims()) = func.VectorField()

# Create a LanceDB table with the schema
import lancedb
db = lancedb.connect("./mydb")
table = db.create_table("mytable", schema=MySchema)
# Insert data - embeddings are generated automatically
table.add([
    {"text": "This is a test."},
    {"text": "Another example."}
])

# Query the table - embeddings are generated for the query
results = table.search("test example").limit(5).to_pandas()
print(results)

## Example Output
#                                   text                            vector  _distance
# 0                     This is a test.  [0.0123, -0.0456, ..., 0.0789]  0.123456
# 1                     Another example.  [0.0234, -0.0567, ..., 0.0890]  0.234567

Using an embedding function

The .create() method accepts several arguments to configure the embedding function’s behavior. max_retries is a special argument that applies to all providers.

Argument	Type	Description
`name`	`str`	The name of the model to use (e.g., `text-embedding-3-small`).
`max_retries`	`int`	The maximum number of times to retry on a failed API request. Defaults to 7.

Other arguments are provider-specific. Common arguments include the following:

Argument	Type	Description
`batch_size`	`int`	The number of inputs to process in a single batch. Provider-specific.
`api_key`	`str`	The API key for the embedding provider. Can also be set via environment variables.
`device`	`str`	The device to run the model on (e.g., “cpu”, “cuda”). Defaults to automatic detection.

Find the full list of arguments for each provider in the integrations section.

Embedding model providers

LanceDB supports most popular embedding providers.

Text embeddings

Provider	Model ID	Default Model
OpenAI	`openai`	`text-embedding-ada-002`
Sentence Transformers	`sentence-transformers`	`all-MiniLM-L6-v2`
Hugging Face	`huggingface`	`colbert-ir/colbertv2.0`
Cohere	`cohere`	`embed-english-v3.0`
…	…	…

Multimodal embedding

Provider	Model ID	Supported Inputs
OpenCLIP	`open-clip`	Text, Images
ImageBind	`imagebind`	Text, Images, Audio, Video
…	…	…

You can find all supported embedding models in the integrations section.

Embeddings in LanceDB Cloud and Enterprise

Currently, the embedding registry on LanceDB Cloud or Enterprise supports automatic generation of embeddings during data ingestion, generated on the client side (and stored on the remote table). We don’t yet support automatic query-time embedding generation when sending queries, though this is planned for a future release. For now, you can manually generate the embeddings at query time using the same embedding function that was used during ingestion, and pass the embeddings to the search function.

import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

db = lancedb.connect(...)
func = get_registry().get("sentence-transformers").create()

class MySchema(LanceModel):
    text: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()

table = db.create_table("mytable", schema=MySchema)
table.add([
    {"text": "This is a test."},
    {"text": "Another example."}
])

# Manually generate embeddings for the query
query_vector = func.generate_embeddings(["test example"])[0]
results = table.search(query_vector).limit(5).to_pandas()

Custom Embedding Functions

You can always implement your own embedding function by inheriting from TextEmbeddingFunction (for text) or EmbeddingFunction (for multimodal data).

from lancedb.embeddings import register, TextEmbeddingFunction
from functools import cached_property

@register("my-embedder")
class MyTextEmbedder(TextEmbeddingFunction):
    model_name: str = "my-model"
    
    def generate_embeddings(self, texts: list[str]) -> list[list[float]]:
        # Your embedding logic here
        return self._model.encode(texts).tolist()
    
    def ndims(self) -> int:
        # Return the dimensionality of the embeddings
        return len(self.generate_embeddings(["test"])[0])
    
    @cached_property
    def _model(self):
        # Initialize your model once
        return MyEmbeddingModel(self.model_name)

Get started

User Guide

Feature Engineering (Geneva)

Support

Managing Embeddings

Embedding Registry

Using an embedding function

Embedding model providers

Text embeddings

Multimodal embedding

Embeddings in LanceDB Cloud and Enterprise

Custom Embedding Functions

Get started

User Guide

Feature Engineering (Geneva)

Support

​Embedding Registry

​Using an embedding function

​Embedding model providers

​Text embeddings

​Multimodal embedding

​Embeddings in LanceDB Cloud and Enterprise

​Custom Embedding Functions

Embedding Registry

Using an embedding function

Embedding model providers

Text embeddings

Multimodal embedding

Embeddings in LanceDB Cloud and Enterprise

Custom Embedding Functions