Getting Started

> ## Documentation Index
> Fetch the complete documentation index at: https://docs.ooneex.com/llms.txt
> Use this file to discover all available pages before exploring further.

# HTML

> Parse HTML and extract structured content like links, images, headings, videos, and tasks with a Cheerio-powered API.

`@ooneex/html` wraps [Cheerio](https://cheerio.js.org) in a small, typed `Html` class for parsing HTML and pulling out structured data. Load markup from a string or a URL, then call focused extractors that return plain typed objects for images, links, headings, videos, and checkbox tasks. It's built for scraping and content analysis rather than DOM mutation.

## Installation

Install the package with Bun.

```bash theme={null}
bun add @ooneex/html
```

## Usage

Create an `Html` instance with a markup string (or empty), then query it. All extractors return typed arrays you can iterate directly.

```typescript theme={null}
import { Html } from "@ooneex/html";

const html = new Html(`
  <article>
    <h1 id="title">Getting Started</h1>
    <p>Read the <a href="/docs" title="Docs">documentation</a>.</p>
    <img src="/hero.png" alt="Hero" width="800" />
  </article>
`);

html.getHeadings();
// [{ level: 1, text: "Getting Started", id: "title" }]

html.getLinks();
// [{ href: "/docs", text: "documentation", title: "Docs", target: null, rel: null }]

html.getImages();
// [{ src: "/hero.png", alt: "Hero", title: null, width: "800", height: null }]
```

### Loading from a URL

Use `loadUrl` to fetch and parse a remote page. It returns the instance, so you can chain an extractor right away.

```typescript theme={null}
import { Html } from "@ooneex/html";

const page = await new Html().loadUrl("https://example.com");

const links = page.getLinks();
const text = page.getContent(); // trimmed plain-text of the whole document
```

You can also reuse an instance and swap its content with `load`, which returns `this` for chaining.

```typescript theme={null}
const html = new Html();

html.load("<h2>Section</h2>").getHeadings();
```

### Extracting videos and tasks

`getVideos` collects `<video>` elements with their attributes and nested `<source>` tags, and `getTasks` reads checkbox list items.

```typescript theme={null}
import { Html } from "@ooneex/html";

const html = new Html(`
  <video poster="/poster.jpg" controls>
    <source src="/clip.webm" type="video/webm" />
  </video>
  <ul>
    <li><input type="checkbox" checked /> Write docs</li>
    <li><input type="checkbox" /> Ship release</li>
  </ul>
`);

html.getVideos();
// [{ src: null, poster: "/poster.jpg", controls: true, autoplay: false,
//    loop: false, muted: false, width: null, height: null,
//    sources: [{ src: "/clip.webm", type: "video/webm" }] }]

html.getTasks();
// [{ text: "Write docs", checked: true }, { text: "Ship release", checked: false }]
```

You can also get the serialized markup back with `getHtml()`.

## When to use it

* Scraping or analyzing remote pages — fetch with `loadUrl` and pull out links, images, or headings.
* Extracting a table of contents or outline from rendered HTML via `getHeadings()`.
* Collecting media references (`getImages`, `getVideos`) from user-supplied or fetched markup.
* Parsing checkbox task lists out of HTML (e.g. rendered Markdown) with `getTasks()`.
* Grabbing the clean text content of a document with `getContent()`.

You don't need it if you only need to build or template HTML strings, or if you're already running in a browser with direct DOM access — reach for it specifically when you need to parse and extract from existing markup server-side.