title

summary

aliases

Get Started with TiDB + AI via Python

Learn how to get started with vector search in TiDB using Python SDK.

/tidb/stable/vector-search-get-started-using-python/

/tidb/dev/vector-search-get-started-using-python/

/tidbcloud/vector-search-get-started-using-python/

Get Started with TiDB + AI via Python

This document demonstrates how to get started with Vector Search in TiDB using Python SDK. Follow along to build your first AI application working with TiDB.

By following this document, you will learn how to:

Connect to TiDB using the TiDB Python SDK.
Generate text embeddings with popular embedding models.
Store vectors in TiDB tables.
Perform semantic search using vector similarity.

Note:

The vector search feature is in beta and might be changed without prior notice. If you find a bug, you can report an issue on GitHub.

The vector search feature is available on TiDB Self-Managed, {{{ .starter }}}, {{{ .essential }}}, and TiDB Cloud Dedicated. For TiDB Self-Managed and TiDB Cloud Dedicated, the TiDB version must be v8.4.0 or later (v8.5.0 or later is recommended).

Prerequisites

Go to tidbcloud.com to create a TiDB Cloud Starter cluster for free or using tiup playground to deploy a TiDB Self-Managed cluster for local testing.

Installation

pytidb is the official Python SDK for TiDB, designed to help developers build AI applications efficiently.

To install the Python SDK, run the following command:

pip install pytidb

To use built-in embedding function, install the models extension (alternative):

pip install "pytidb[models]"

Connect to database

You can get these connection parameters from the TiDB Cloud console:

Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.

For example, if the connection parameters are displayed as follows:

HOST:     gateway01.us-east-1.prod.shared.aws.tidbcloud.com
PORT:     4000
USERNAME: 4EfqPF23YKBxaQb.root
PASSWORD: abcd1234
DATABASE: test
CA:       /etc/ssl/cert.pem

The corresponding Python code to connect to the TiDB Cloud Starter cluster would be as follows:

from pytidb import TiDBClient

client = TiDBClient.connect(
    host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com",
    port=4000,
    username="4EfqPF23YKBxaQb.root",
    password="abcd1234",
    database="test",
)

Note:

The preceding example is for demonstration purposes only. You need to fill in the parameters with your own values and keep them secure.

Here is a basic example for connecting to a self-managed TiDB cluster:

from pytidb import TiDBClient

client = TiDBClient.connect(
    host="localhost",
    port=4000,
    username="root",
    password="",
    database="test",
    ensure_db=True,
)

Note:

Make sure to update the connection parameters according to your actual deployment.

Once connected, you can use the client object to operate tables, query data, and more.

Create an embedding function

When working with embedding models, you can leverage the embedding function to automatically vectorize your data at both insertion and query stages. It natively supports popular embedding models like OpenAI, Jina AI, Hugging Face, Sentence Transformers, and others.

Go to OpenAI platform to create your API key for embedding.

from pytidb.embeddings import EmbeddingFunction

text_embed = EmbeddingFunction(
    model_name="openai/text-embedding-3-small",
    api_key="<your-openai-api-key>",
)

Go to Jina AI to create your API key for embedding.

from pytidb.embeddings import EmbeddingFunction

text_embed = EmbeddingFunction(
    model_name="jina/jina-embeddings-v3",
    api_key="<your-jina-api-key>",
)

Create a table

As an example, create a table named chunks with the following columns:

id (int): the ID of the chunk.
text (text): the text content of the chunk.
text_vec (vector): the vector embeddings of the text.
user_id (int): the ID of the user who created the chunk.

from pytidb.schema import TableModel, Field, VectorField

class Chunk(TableModel):
    id: int | None = Field(default=None, primary_key=True)
    text: str = Field()
    text_vec: list[float] = text_embed.VectorField(source_field="text")
    user_id: int = Field()

table = client.create_table(schema=Chunk, if_exists="overwrite")

Once created, you can use the table object to insert data, search data, and more.

Insert Data

Now let's add some sample data to our table.

table.bulk_insert([
    # 👇 The text will be automatically embedded and populated into the `text_vec` field.
    Chunk(text="PyTiDB is a Python library for developers to connect to TiDB.", user_id=2),
    Chunk(text="LlamaIndex is a framework for building AI applications.", user_id=2),
    Chunk(text="OpenAI is a company and platform that provides AI models service and tools.", user_id=3),
])

Search for nearest neighbors

To search for nearest neighbors of a given query, you can use the table.search() method. This method performs a vector search by default.

table.search(
    # 👇 Pass the query text directly, it will be embedded to a query vector automatically.
    "A library for my artificial intelligence software"
)
.limit(3).to_list()

In this example, vector search compares the query vector with the stored vectors in the text_vec field of the chunks table and returns the top 3 most semantically relevant results based on similarity scores.

The closer _distance means the more similar the two vectors are.

[
    {
        'id': 2,
        'text': 'LlamaIndex is a framework for building AI applications.',
        'text_vec': [...],
        'user_id': 2,
        '_distance': 0.5719928358786761,
        '_score': 0.4280071641213239
    },
    {
        'id': 3,
        'text': 'OpenAI is a company and platform that provides AI models service and tools.',
        'text_vec': [...],
        'user_id': 3,
        '_distance': 0.603133726213383,
        '_score': 0.396866273786617
    },
    {
        'id': 1,
        'text': 'PyTiDB is a Python library for developers to connect to TiDB.',
        'text_vec': [...],
        'user_id': 2,
        '_distance': 0.6202191842385758,
        '_score': 0.3797808157614242
    }
]

Delete data

To delete a specific row from the table, you can use the table.delete() method:

table.delete({
    "id": 1
})

Drop table

When you no longer need a table, you can drop it using the client.drop_table() method:

client.drop_table("chunks")

Next steps

Learn more details about Vector Search, Full-Text Search and Hybrid Search in TiDB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get Started with TiDB + AI via Python

Prerequisites

Installation

Connect to database

Create an embedding function

Create a table

Insert Data

Search for nearest neighbors

Delete data

Drop table

Next steps

FilesExpand file tree

quickstart-via-python.md

Latest commit

History

quickstart-via-python.md

File metadata and controls

Get Started with TiDB + AI via Python

Prerequisites

Installation

Connect to database

Create an embedding function

Create a table

Insert Data

Search for nearest neighbors

Delete data

Drop table

Next steps