Getting Started with Vector Databases with `pgvector` and Python. 1/n
Whether building Retrieval-Augmented Generation (RAG) for LLMs or a face recognition application, we must compare and find the closest embeddings. To achieve this, we can use specialized vector databases, known for their capability to perform similarity searches. Popular options include: ChromaDB, pinecone, GCP Vector Search, etc.
In this tutorial, we’ll install and upgrade our PostgreSQL database with the pgvector
extension, empowering it for vector search tasks.
We will be installing PostgreSQL as well as pgvector
on the Ubuntu 22.04 server. To follow along access your server via SSH or a direct terminal session.
Our Roadmap:
- Prepare the Foundation: Install PostgreSQL.
- Enhance with Vector Power: Install the
pgvector
extension. - Set Up Your Data Playground: Create a user, database, and table for testing.
- Bridge to Code: Install Python dependencies.
- Make the Connection: Link PostgreSQL and your Python environment.
1. Install PostgreSQL
PostgreSQL is a relational database for storing and organizing structured data. Whether building a web application, managing customer records, or analyzing complex datasets, PostgreSQL provides the tools needed.
Let’s install PostgreSQL using following commands:
# in user@server
sudo apt update
sudo apt install -y postgresql postgresql-contrib
# replace <14> in the following command with your version
# check using `psql --version`
sudo apt install postgresql-server-dev-14
2. Install `pgvector`
pgvector
unlocks the ability of vector similarity search within the PostgreSQL database. It introduces specialized data types, functions, and operators designed to store, manipulate, and efficiently search through high-dimensional vector representations. By embedding pgvector
into your existing PostgreSQL setup, you can seamlessly perform tasks like finding similar images, identifying related text content, or building recommendation systems directly alongside your structured data.
To install pgvector
, follow these steps:
Prepare your system (Ubuntu 22.04 example):
sudo apt install -y build-essential # Install 'gcc' compiler
Download and Install pgvector
:
cd /tmp
git clone --branch v0.6.2 https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
We are all set! Now we can harness the power of vector search within PostgreSQL.
3. Create User, Database, Table, and Insert Items
We will use the following steps to configure the PostgreSQL database for vector search functionality.
- First, we gain administrative privileges as the
postgres
user.
# 1. in user@server
sudo -i -u postgres
2. Then, we will create a dedicated user anas
and database embeddings_db
for storing embeddings.
# 2. in postgres@pgserver#
createuser --interactive #i used anas as username and yes for superuser
createdb --owner=anas embeddings_db
3. Inside the database, we enable the pgvector
extension, create a table to hold embeddings, and add some sample data.
psql
# 3. in postgres=#
\password
\c embeddings_db
CREATE EXTENSION vector;
CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(3));
INSERT INTO items (embedding) VALUES ('[1,2,3]'), ('[4,5,6]');
4. Finally, we demonstrate a simple similarity search and exit the PostgreSQL environment.
# sample query
SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
\q
# in postgres@pgserver#
exit
# now we should be in <user>@<server>
4. Install Python and dependencies
We will follow these steps to prepare our server for working with Python.
- First, we install Python 3.10 (a specific version of the language) and its virtual environment tool (
python3.10-venv
).
# in user@server
sudo apt install python3.10 python3.10-venv
2. Next, we create a virtual environment named, isolating project dependencies.
cd ~
python -m venv env
3. We activate the environment (. env/bin/activate
) ensures we're working within that isolated space.
. env/bin/activate
4. Finally, we install psycopg2-binary
, to interact with your PostgreSQL database.
pip install psycopg2-binary
5. Link PostgreSQL and your Python environment
To begin creating your Python script, we’ll use nano
text editor and create a file named main.py
.
#in user@server
nano main.py
In this step, we’ll write a Python script to connect to your ‘items’ database. For simplicity, we’re directly including the username (‘anas’) and password (‘dba’) in the code. Important: In a real-world application, it’s strongly recommended to use more secure techniques for managing passwords, such as environment variables or secrets managers.
import psycopg2
conn = psycopg2.connect(
database="embeddings_db",
user="anas",
password="dba",
host="localhost",
port="5432"
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;")
results = cursor.fetchall()
# Print results (if any)
for row in results:
print(row)
# Close connections
cursor.close()
conn.close()
We will use Ctrl + S
to save our changes. Once saved, press Ctrl + X
to exit the ‘nano’ editor and return to the terminal.
Now it’s time to see our setup in action! By executing the command python main.py
within your activated virtual environment ('env'), you run your Python script.
#in (env) user@server
python main.py
# output should be like this
(1, '[1,2,3]')
(2, '[4,5,6]')
The expected output indicates a successful connection and query! This demonstrates that your PostgreSQL database, empowered by pgvector
, is ready for vector-based similarity search!
And There We Have It! Together, we’ve successfully integrated thepgvector
extension into our PostgreSQL database and harnessed it from Python. This unlocks a whole new world of applications — from image similarity search to recommendation engines and more.