| title | summary | aliases | |||
|---|---|---|---|---|---|
Get Started with TiDB + AI via Python |
Learn how to get started with vector search in TiDB using Python SDK. |
|
This document demonstrates how to get started with Vector Search in TiDB using Python SDK. Follow along to build your first AI application working with TiDB.
By following this document, you will learn how to:
- Connect to TiDB using the TiDB Python SDK.
- Generate text embeddings with popular embedding models.
- Store vectors in TiDB tables.
- Perform semantic search using vector similarity.
Note:
- The vector search feature is in beta and might be changed without prior notice. If you find a bug, you can report an issue on GitHub.
- The vector search feature is available on TiDB Self-Managed, {{{ .starter }}}, {{{ .essential }}}, and TiDB Cloud Dedicated. For TiDB Self-Managed and TiDB Cloud Dedicated, the TiDB version must be v8.4.0 or later (v8.5.0 or later is recommended).
- Go to tidbcloud.com to create a TiDB Cloud Starter cluster for free or using tiup playground to deploy a TiDB Self-Managed cluster for local testing.
pytidb is the official Python SDK for TiDB, designed to help developers build AI applications efficiently.
To install the Python SDK, run the following command:
pip install pytidbTo use built-in embedding function, install the models extension (alternative):
pip install "pytidb[models]"You can get these connection parameters from the TiDB Cloud console:
- Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
- Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.
For example, if the connection parameters are displayed as follows:
HOST: gateway01.us-east-1.prod.shared.aws.tidbcloud.com
PORT: 4000
USERNAME: 4EfqPF23YKBxaQb.root
PASSWORD: abcd1234
DATABASE: test
CA: /etc/ssl/cert.pem
The corresponding Python code to connect to the TiDB Cloud Starter cluster would be as follows:
from pytidb import TiDBClient
client = TiDBClient.connect(
host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com",
port=4000,
username="4EfqPF23YKBxaQb.root",
password="abcd1234",
database="test",
)Note:
The preceding example is for demonstration purposes only. You need to fill in the parameters with your own values and keep them secure.
Here is a basic example for connecting to a self-managed TiDB cluster:
from pytidb import TiDBClient
client = TiDBClient.connect(
host="localhost",
port=4000,
username="root",
password="",
database="test",
ensure_db=True,
)Note:
Make sure to update the connection parameters according to your actual deployment.
Once connected, you can use the client object to operate tables, query data, and more.
When working with embedding models, you can leverage the embedding function to automatically vectorize your data at both insertion and query stages. It natively supports popular embedding models like OpenAI, Jina AI, Hugging Face, Sentence Transformers, and others.
Go to OpenAI platform to create your API key for embedding.
from pytidb.embeddings import EmbeddingFunction
text_embed = EmbeddingFunction(
model_name="openai/text-embedding-3-small",
api_key="<your-openai-api-key>",
)Go to Jina AI to create your API key for embedding.
from pytidb.embeddings import EmbeddingFunction
text_embed = EmbeddingFunction(
model_name="jina/jina-embeddings-v3",
api_key="<your-jina-api-key>",
)As an example, create a table named chunks with the following columns:
id(int): the ID of the chunk.text(text): the text content of the chunk.text_vec(vector): the vector embeddings of the text.user_id(int): the ID of the user who created the chunk.
from pytidb.schema import TableModel, Field, VectorField
class Chunk(TableModel):
id: int | None = Field(default=None, primary_key=True)
text: str = Field()
text_vec: list[float] = text_embed.VectorField(source_field="text")
user_id: int = Field()
table = client.create_table(schema=Chunk, if_exists="overwrite")Once created, you can use the table object to insert data, search data, and more.
Now let's add some sample data to our table.
table.bulk_insert([
# 👇 The text will be automatically embedded and populated into the `text_vec` field.
Chunk(text="PyTiDB is a Python library for developers to connect to TiDB.", user_id=2),
Chunk(text="LlamaIndex is a framework for building AI applications.", user_id=2),
Chunk(text="OpenAI is a company and platform that provides AI models service and tools.", user_id=3),
])To search for nearest neighbors of a given query, you can use the table.search() method. This method performs a vector search by default.
table.search(
# 👇 Pass the query text directly, it will be embedded to a query vector automatically.
"A library for my artificial intelligence software"
)
.limit(3).to_list()In this example, vector search compares the query vector with the stored vectors in the text_vec field of the chunks table and returns the top 3 most semantically relevant results based on similarity scores.
The closer _distance means the more similar the two vectors are.
[
{
'id': 2,
'text': 'LlamaIndex is a framework for building AI applications.',
'text_vec': [...],
'user_id': 2,
'_distance': 0.5719928358786761,
'_score': 0.4280071641213239
},
{
'id': 3,
'text': 'OpenAI is a company and platform that provides AI models service and tools.',
'text_vec': [...],
'user_id': 3,
'_distance': 0.603133726213383,
'_score': 0.396866273786617
},
{
'id': 1,
'text': 'PyTiDB is a Python library for developers to connect to TiDB.',
'text_vec': [...],
'user_id': 2,
'_distance': 0.6202191842385758,
'_score': 0.3797808157614242
}
]To delete a specific row from the table, you can use the table.delete() method:
table.delete({
"id": 1
})When you no longer need a table, you can drop it using the client.drop_table() method:
client.drop_table("chunks")- Learn more details about Vector Search, Full-Text Search and Hybrid Search in TiDB.