ExamplescriptintermediateRunnableresearch-brief
Embedding
Runnable example (intermediate) for script using docling, ipykernel.
Key Facts
- Level
- intermediate
- Runtime
- Python • OpenAI API
- Pattern
- Context-backed research with explicit evidence
- Interaction
- Live sandbox • Script
- Updated
- 14 March 2026
Navigate this example
Library
Browse examplesReopen the wider library to compare adjacent patterns and linked learning paths.Interaction
Run sandbox nowTry the interaction directly in this example’s guided sandbox surface.Source
Open full sourceRead the real implementation, highlighted checkpoints, and runtime requirements.MCP
Call via MCPUse the same resource inside agents, deterministic exports, and MCP setup flows.
Linked principles
3-embedding.py
python
from typing import List
import lancedb
from docling.chunking import HybridChunker
from docling.document_converter import DocumentConverter
from dotenv import load_dotenv
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector
from openai import OpenAI
from utils.tokenizer import OpenAITokenizerWrapper
load_dotenv()
# Initialize OpenAI client (make sure you have OPENAI_API_KEY in your environment variables)
client = OpenAI()
tokenizer = OpenAITokenizerWrapper() # Load our custom tokenizer for OpenAI
MAX_TOKENS = 8191 # text-embedding-3-large's maximum context length
# --------------------------------------------------------------
# Extract the data
# --------------------------------------------------------------
converter = DocumentConverter()
result = converter.convert("https://arxiv.org/pdf/2408.09869")
# --------------------------------------------------------------
# Apply hybrid chunking
# --------------------------------------------------------------
chunker = HybridChunker(
tokenizer=tokenizer,
max_tokens=MAX_TOKENS,
merge_peers=True,
)
chunk_iter = chunker.chunk(dl_doc=result.document)
chunks = list(chunk_iter)
# --------------------------------------------------------------
# Create a LanceDB database and table
# --------------------------------------------------------------
# Create a LanceDB database
db = lancedb.connect("data/lancedb")
# Get the OpenAI embedding function
func = get_registry().get("openai").create(name="text-embedding-3-large")
# Define a simplified metadata schema
class ChunkMetadata(LanceModel):
"""
You must order the fields in alphabetical order.
This is a requirement of the Pydantic implementation.
"""
filename: str | None
page_numbers: List[int] | None
title: str | None
# Define the main Schema
class Chunks(LanceModel):
text: str = func.SourceField()
vector: Vector(func.ndims()) = func.VectorField() # type: ignore
metadata: ChunkMetadata
table = db.create_table("docling", schema=Chunks, mode="overwrite")
# --------------------------------------------------------------
# Prepare the chunks for the table
# --------------------------------------------------------------
# Create table with processed chunks
processed_chunks = [
{
"text": chunk.text,
"metadata": {
"filename": chunk.meta.origin.filename,
"page_numbers": [
page_no
for page_no in sorted(
set(
prov.page_no
for item in chunk.meta.doc_items
for prov in item.prov
)
)
]
or None,
"title": chunk.meta.headings[0] if chunk.meta.headings else None,
},
}
for chunk in chunks
]
# --------------------------------------------------------------
# Add the chunks to the table (automatically embeds the text)
# --------------------------------------------------------------
table.add(processed_chunks)
# --------------------------------------------------------------
# Load the table
# --------------------------------------------------------------
table.to_pandas()
table.count_rows()
Related principles
- P4trustApply progressive disclosure to system agencyProvide the minimum information necessary by default, while enabling users to inspect additional detail when confidence, understanding, or intervention is required.Open principle →
- P6visibilityExpose meaningful operational state, not internal complexityPresent the state of the system in language and structures that are relevant to the user, rather than exposing low-level internals that do not support action or understanding.Open principle →
- P7trustEstablish trust through inspectabilityUsers should be able to examine how a result was produced when confidence, accountability, or decision quality is important.Open principle →