> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ooneex.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Convertor

> Convert PDF documents into structured chunks for embedding

The `Convertor` turns a PDF into a stream of clean, heading-aware text chunks ready to embed and add to a [vector table](/ai/rag/vector-table). It parses the document structure — headings, paragraphs, and lists — and groups content under each heading, tracking the pages each chunk spans.

## Converting a PDF

Create a `Convertor` with the path to a source file, then iterate `convert()`. It is an async generator: each iteration yields one chunk as it is produced, so you can embed and store chunks while the rest of the document is still being parsed.

```typescript theme={null}
import { Convertor } from "@ooneex/rag";

const convertor = new Convertor("./docs/handbook.pdf");

for await (const chunk of convertor.convert({ outputDir: "./output" })) {
  console.log(chunk.metadata.heading); // section heading, or null
  console.log(chunk.metadata.pages);   // pages the chunk spans, e.g. [3, 4]
  console.log(chunk.text);             // the chunk content
}
```

### Feeding chunks into a table

A chunk maps directly onto a record — use the chunk `text` as the embedded `text`, and lift its metadata into your `metadata` object.

```typescript theme={null}
const records = [];
let i = 0;

for await (const chunk of convertor.convert({ outputDir: "./output" })) {
  records.push({
    id: `handbook-${i++}`,
    text: chunk.text,
    metadata: {
      heading: chunk.metadata.heading ?? "",
      source: chunk.metadata.source ?? "",
      page: chunk.metadata.page ?? 0,
    },
  });
}

await table.add(records);
```

## Options

Pass an options object to `convert()`. All fields are optional.

| Option        | Type              | Description                                                                                                                 |
| ------------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------- |
| `outputDir`   | `string`          | Directory where converted files (JSON, Markdown, extracted images) are written. Each conversion gets its own sub-directory. |
| `password`    | `string`          | Password used to unlock an encrypted PDF before conversion.                                                                 |
| `imageFormat` | `"png" \| "jpeg"` | Format for images extracted from the document.                                                                              |
| `pages`       | `string`          | Page selection to convert, e.g. `"1,3-5"`. When omitted, the whole document is processed.                                   |
| `quiet`       | `boolean`         | Suppress progress and informational output from the underlying converter.                                                   |

```typescript theme={null}
for await (const chunk of convertor.convert({
  outputDir: "./output",
  password: "secret",
  imageFormat: "png",
  pages: "1-10",
  quiet: true,
})) {
  // ...
}
```

## The return value

Beyond the yielded chunks, the generator *returns* the paths of the converted JSON and Markdown files once iteration completes. Capture it with a manual iterator if you need those artifacts:

```typescript theme={null}
const iterator = convertor.convert({ outputDir: "./output" });

let next = await iterator.next();
while (!next.done) {
  const chunk = next.value;
  // ... process chunk
  next = await iterator.next();
}

// next.value is the final result once done
const { json, markdown } = next.value;
console.log(json.path);     // path to the converted JSON file
console.log(markdown.path); // path to the converted Markdown file
```

## Chunk shape

Each yielded value is a `ChunkType`:

```typescript theme={null}
type ChunkType = {
  text: string;
  metadata: {
    heading: string | null; // heading the chunk falls under
    page: number | null;     // the chunk's starting page
    pages: number[];         // all pages the chunk spans
    source: string | null;   // the source file name
  };
};
```

Chunks are formed by walking the document: each `heading` starts a new chunk, and the `paragraph` and `list` content that follows is grouped into it until the next heading. The `pages` array accumulates every page the grouped content appears on.

## Exceptions

`ConvertorException` is thrown when conversion fails — for example when no JSON or Markdown output is produced, or the underlying parser errors. It carries a machine-readable `key` (such as `NO_JSON_OUTPUT`, `NO_MARKDOWN_OUTPUT`, or `CHUNKING_FAILED`) and the `source` path in its `data`.

```typescript theme={null}
import { ConvertorException } from "@ooneex/rag";

try {
  for await (const chunk of convertor.convert()) {
    // ...
  }
} catch (error) {
  if (error instanceof ConvertorException) {
    console.error(`[${error.key}] ${error.message}`, error.data);
  }
}
```
