Convertor turns a PDF into a stream of clean, heading-aware text chunks ready to embed and add to a vector table. It parses the document structure — headings, paragraphs, and lists — and groups content under each heading, tracking the pages each chunk spans.
Converting a PDF
Create aConvertor with the path to a source file, then iterate convert(). It is an async generator: each iteration yields one chunk as it is produced, so you can embed and store chunks while the rest of the document is still being parsed.
Feeding chunks into a table
A chunk maps directly onto a record — use the chunktext as the embedded text, and lift its metadata into your metadata object.
Options
Pass an options object toconvert(). All fields are optional.
| Option | Type | Description |
|---|---|---|
outputDir | string | Directory where converted files (JSON, Markdown, extracted images) are written. Each conversion gets its own sub-directory. |
password | string | Password used to unlock an encrypted PDF before conversion. |
imageFormat | "png" | "jpeg" | Format for images extracted from the document. |
pages | string | Page selection to convert, e.g. "1,3-5". When omitted, the whole document is processed. |
quiet | boolean | Suppress progress and informational output from the underlying converter. |
The return value
Beyond the yielded chunks, the generator returns the paths of the converted JSON and Markdown files once iteration completes. Capture it with a manual iterator if you need those artifacts:Chunk shape
Each yielded value is aChunkType:
heading starts a new chunk, and the paragraph and list content that follows is grouped into it until the next heading. The pages array accumulates every page the grouped content appears on.
Exceptions
ConvertorException is thrown when conversion fails — for example when no JSON or Markdown output is produced, or the underlying parser errors. It carries a machine-readable key (such as NO_JSON_OUTPUT, NO_MARKDOWN_OUTPUT, or CHUNKING_FAILED) and the source path in its data.