Skip to main content
The Convertor turns a PDF into a stream of clean, heading-aware text chunks ready to embed and add to a vector table. It parses the document structure — headings, paragraphs, and lists — and groups content under each heading, tracking the pages each chunk spans.

Converting a PDF

Create a Convertor with the path to a source file, then iterate convert(). It is an async generator: each iteration yields one chunk as it is produced, so you can embed and store chunks while the rest of the document is still being parsed.
import { Convertor } from "@ooneex/rag";

const convertor = new Convertor("./docs/handbook.pdf");

for await (const chunk of convertor.convert({ outputDir: "./output" })) {
  console.log(chunk.metadata.heading); // section heading, or null
  console.log(chunk.metadata.pages);   // pages the chunk spans, e.g. [3, 4]
  console.log(chunk.text);             // the chunk content
}

Feeding chunks into a table

A chunk maps directly onto a record — use the chunk text as the embedded text, and lift its metadata into your metadata object.
const records = [];
let i = 0;

for await (const chunk of convertor.convert({ outputDir: "./output" })) {
  records.push({
    id: `handbook-${i++}`,
    text: chunk.text,
    metadata: {
      heading: chunk.metadata.heading ?? "",
      source: chunk.metadata.source ?? "",
      page: chunk.metadata.page ?? 0,
    },
  });
}

await table.add(records);

Options

Pass an options object to convert(). All fields are optional.
OptionTypeDescription
outputDirstringDirectory where converted files (JSON, Markdown, extracted images) are written. Each conversion gets its own sub-directory.
passwordstringPassword used to unlock an encrypted PDF before conversion.
imageFormat"png" | "jpeg"Format for images extracted from the document.
pagesstringPage selection to convert, e.g. "1,3-5". When omitted, the whole document is processed.
quietbooleanSuppress progress and informational output from the underlying converter.
for await (const chunk of convertor.convert({
  outputDir: "./output",
  password: "secret",
  imageFormat: "png",
  pages: "1-10",
  quiet: true,
})) {
  // ...
}

The return value

Beyond the yielded chunks, the generator returns the paths of the converted JSON and Markdown files once iteration completes. Capture it with a manual iterator if you need those artifacts:
const iterator = convertor.convert({ outputDir: "./output" });

let next = await iterator.next();
while (!next.done) {
  const chunk = next.value;
  // ... process chunk
  next = await iterator.next();
}

// next.value is the final result once done
const { json, markdown } = next.value;
console.log(json.path);     // path to the converted JSON file
console.log(markdown.path); // path to the converted Markdown file

Chunk shape

Each yielded value is a ChunkType:
type ChunkType = {
  text: string;
  metadata: {
    heading: string | null; // heading the chunk falls under
    page: number | null;     // the chunk's starting page
    pages: number[];         // all pages the chunk spans
    source: string | null;   // the source file name
  };
};
Chunks are formed by walking the document: each heading starts a new chunk, and the paragraph and list content that follows is grouped into it until the next heading. The pages array accumulates every page the grouped content appears on.

Exceptions

ConvertorException is thrown when conversion fails — for example when no JSON or Markdown output is produced, or the underlying parser errors. It carries a machine-readable key (such as NO_JSON_OUTPUT, NO_MARKDOWN_OUTPUT, or CHUNKING_FAILED) and the source path in its data.
import { ConvertorException } from "@ooneex/rag";

try {
  for await (const chunk of convertor.convert()) {
    // ...
  }
} catch (error) {
  if (error instanceof ConvertorException) {
    console.error(`[${error.key}] ${error.message}`, error.data);
  }
}