Mở rộng

Asset Index thiết kế modular — mỗi component (analyzer, embed, store, search) thay được mà không phải sửa toàn bộ.

Thêm format mới

Mặc định hỗ trợ ext defined trong skills/_shared/pipeline_utils.py:

MEDIA_IMAGE_EXTENSIONS = {".jpg", ".jpeg", ".png", ".webp", ".gif", ...}
MEDIA_VIDEO_EXTENSIONS = {".mp4", ".mov", ".mkv", ".webm", ...}
MEDIA_AUDIO_EXTENSIONS = {".wav", ".mp3", ".m4a", ".flac", ...}

Để thêm .heic:

Add ext vào set tương ứng

Edit skills/_shared/pipeline_utils.py:

MEDIA_IMAGE_EXTENSIONS = {..., ".heic"}

Đảm bảo Gemini Vision đọc được

Gemini hỗ trợ HEIC trực tiếp. Nếu format khác (ví dụ .tiff mà LLM không đọc được), pre-convert bằng ffmpeg trong image_gemini.analyze():

if path.suffix.lower() == ".tiff":
    png = path.with_suffix(".png")
    subprocess.run(["ffmpeg", "-i", path, png], check=True)
    path = png

Test

cp test.heic raw_assets/images/
.venv/bin/python -m tools.asset_index.search "<query>" --media image

Đổi embed model

Mặc định: OpenAI text-embedding-3-small (1536 chiều). Đổi sang text-embedding-3-large (3072 chiều):

Update embed.py

DEFAULT_MODEL = "text-embedding-3-large"

Update schema.sql

CREATE VIRTUAL TABLE assets_vec USING vec0(
  id TEXT PRIMARY KEY,
  embedding FLOAT[3072]   -- thay 1536
);

Migrate DB hiện có

Vì dim đổi, embedding cũ không tương thích. Cần re-embed toàn bộ:

rm .asset_index/index.db
.venv/bin/python -m tools.asset_index.watcher --scan-on-start

Hoặc giữ DB cũ và build DB mới song song (nếu muốn A/B test).

Đổi embed model là breaking change — phải re-embed toàn bộ. Cân nhắc cost trước (tốn quota OpenAI).

Đổi sang model OSS

Để chạy embed local (không tốn API):

from sentence_transformers import SentenceTransformer

class LocalEmbedder:
    def __init__(self):
        self.model = SentenceTransformer("BAAI/bge-m3")

    def embed(self, text: str) -> list[float]:
        return self.model.encode(text).tolist()

Cập nhật embed.py import và schema dim. BGE-M3 = 1024 chiều.

BGE-M3 multilingual, mạnh tiếng Việt, free, chạy ổn trên CPU laptop. Trade-off: chất lượng thường thấp hơn OpenAI cho domain phức tạp.

Custom analyzer

Để thêm analyzer cho format đặc biệt (ví dụ .psd Photoshop):

Tạo file analyzer

tools/asset_index/analyzers/psd_custom.py:

from pathlib import Path
from typing import Any

def analyze(path: Path, *, env_file: Path) -> dict[str, Any]:
    # extract layer info, thumbnail, OCR text...
    return {
        "media_type": "image",
        "summary": "...",
        "raw_json": {...},
    }

Wire vào router.py

if path.suffix.lower() == ".psd":
    return psd_custom.analyze(path, env_file=env_file)

Test

cp test.psd raw_assets/images/
cat .asset_index/state.json

search.py join assets_vec với assets. Thêm filter qua arg + WHERE clause: Ví dụ filter theo mood (đã lưu sẵn trong mood_json):

def search_assets(
    query: str,
    *,
    mood: str | None = None,
    ...
):
    where = []
    if mood:
        where.append(f"json_extract(a.mood_json, '$') LIKE '%{mood}%'")
    where_clause = " AND ".join(where) if where else "1=1"

    sql = f"""
    SELECT a.*, v.distance
    FROM assets_vec v
    JOIN assets a ON a.id = v.id
    WHERE {where_clause}
    ORDER BY v.distance
    LIMIT ?
    """

CLI:

.venv/bin/python -m tools.asset_index.search \
  "phong cảnh" \
  --mood "calm,peaceful"

Custom hook (sau khi index)

Để chạy logic riêng sau mỗi lần index thành công (ví dụ: gửi notification, sync lên cloud): Edit router.process_file() ở cuối:

result = {"status": "ok", ...}
_run_post_index_hooks(result)
return result

Implement _run_post_index_hooks() đọc list từ env var hoặc config file → chạy script ngoài.

Bắt đầu

Cài đặt đầy đủ

Cấu hình

Sử dụng

Skills

Nâng cao

Vận hành

Thêm format mới

Đổi embed model

Đổi sang model OSS

Custom analyzer

Search filter mới

Custom hook (sau khi index)

Bước tiếp theo

Khắc phục sự cố

Đóng góp

Bắt đầu

Cài đặt đầy đủ

Cấu hình

Sử dụng

Skills

Nâng cao

Vận hành

​Thêm format mới

​Đổi embed model

​Đổi sang model OSS

​Custom analyzer

​Search filter mới

​Custom hook (sau khi index)

​Bước tiếp theo

Khắc phục sự cố

Đóng góp

Thêm format mới

Đổi embed model

Đổi sang model OSS

Custom analyzer

Search filter mới

Custom hook (sau khi index)

Bước tiếp theo