Parsing HTML

Package: @lanexio/parser-grammar-html Stable
Layer: 2 (Grammar). Depends only on @lanexio/parser-core.
Runtime: Universal (browser, server, edge worker).

Overview

parseHtml implements the full WHATWG HTML parsing algorithm, including all 23 insertion mode state machines, adoption agency, foster parenting, and foreign content (SVG, MathML). It passes every ingested html5lib tree-construction and tokenizer case with empty skip and expected-fail lists — 5,241 passing tests in the full package suite, with the only 8 skips being scripting-enabled (#script-on) variants that no static parser implements.

Parse input is always a Uint8Array. Output is always a LexTree, even for empty or malformed input.

Import

import {
  parseHtml,
  serializeHtml,
  HtmlKind,
  HtmlField,
  HtmlParseMode,
  type ParseHtmlOptions,
  type SerializeHtmlOptions,
} from '@lanexio/parser-grammar-html';

Parse a document

import { parseHtml } from '@lanexio/parser-grammar-html';

const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<!doctype html><html><body><p>Hello</p></body></html>'));

console.log(tree.nodeCount);   // total nodes
console.log(tree.root.kind);   // Document root kind id

parseHtml accepts a Uint8Array. Always use TextEncoder when converting a string to bytes.

ParseHtmlOptions

Field	Type	Default	Description
`mode`	`HtmlParseMode`	`'document'`	Parse as a full document or as a fragment.
`contextElement`	`string`	`undefined`	Context element name for fragment parsing (e.g. `'body'`).
`contextElementLeaf`	`string`	`undefined`	Leaf element name for void-element fragments.
`scriptingEnabled`	`boolean`	`false`	Whether scripting is considered enabled. Affects `<noscript>` parsing.

Fragment parsing

import { parseHtml, HtmlParseMode } from '@lanexio/parser-grammar-html';

const encoder = new TextEncoder();
const tree = parseHtml(
  encoder.encode('<li>Item one</li><li>Item two</li>'),
  { mode: HtmlParseMode.Fragment, contextElement: 'ul' }
);

Fragment parsing is how browsers parse innerHTML. Pass the tag name of the context element.

Detect LexError nodes

import { parseHtml } from '@lanexio/parser-grammar-html';

const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<table><p>bad nesting'));

for (const node of tree.root.children()) {
  if (node.hasError) {
    console.log('parse error at', node.range);
  }
}

parseHtml never throws. Malformed input — bad nesting, unclosed tags, illegal characters — produces LexError nodes in the AST. The parser always recovers and continues.

Navigate by field name

HTML trees carry field metadata, so LexNode.childByField() and fieldName() work out of the box. An Element node’s structural children are addressable by role:

Field name	Child	Description
`tag`	`StartTag`	The element’s start-tag node (source bytes of `<div …>`).
`attrs`	`AttributeList`	Container of `Attribute` nodes.
`name`	`AttributeName`	An attribute’s name (child of `Attribute`).
`value`	`AttributeValue`	An attribute’s value (child of `Attribute`; absent when the attribute has no value).
`body`	`Element` (body role)	Container of the element’s content children.

import { parseHtml, HtmlKind } from '@lanexio/parser-grammar-html';

const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<a href="/docs" class="nav">Docs</a>'));

// Find the first Element, then read its attributes by field.
for (const node of tree.root.children()) {
  // The Document's first child is the body-role container; descend to elements.
  for (const el of node.children()) {
    if (el.kind !== HtmlKind.Element) continue;
    const attrList = el.childByField('attrs');
    if (!attrList) continue;
    for (const attr of attrList.children()) {
      const name = attr.childByField('name');
      const value = attr.childByField('value');
      console.log(name?.text, '=', value?.text ?? '(no value)');
      // href = "/docs"   class = "nav"
    }
  }
}

tree.fieldIdForName('attrs') returns the numeric field id for hot loops that compare node.fieldId directly.

Serialize back to HTML

import { parseHtml, serializeHtml } from '@lanexio/parser-grammar-html';

const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<p>Hello <b>world</b>'));

const html = serializeHtml(tree);
console.log(html);   // "<html><head></head><body><p>Hello <b>world</b></p></body></html>"

serializeHtml always returns a string and never throws — including on pathologically deep trees (it is fully iterative; 50,000+ levels of nesting are covered by regression tests). Results are memoized per tree: serializing the same LexTree twice returns the cached string.

SerializeHtmlOptions

Field	Type	Default	Description
`outer`	`boolean`	`true`	When `true`, serialize the root node and all its children (outerHTML semantics). When `false`, serialize only the children (innerHTML semantics).
`wasmScanner`	`WasmScanExports`	`undefined`	Optional WASM escape-scanner exports for native-speed text/attribute escaping. When omitted, the pure-TypeScript byte path is used. Output is byte-identical either way (cross-checked by an escape-consistency property suite).

Partial serialization

import { parseHtml, serializeHtml } from '@lanexio/parser-grammar-html';

const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<div><p>First</p><p>Second</p></div>'));

// Serialize only the children of the first element
const root = tree.root;
const first = root.child(0);
if (first) {
  console.log(serializeHtml(first, { outer: false }));
}

HtmlKind constants

import { HtmlKind } from '@lanexio/parser-grammar-html';

// HtmlKind is a const object. Use 'as const' pattern, never enum.
const kind: typeof HtmlKind[keyof typeof HtmlKind] = HtmlKind.Element;

// Example: walk only element nodes
const cursor = tree.cursor();
visit: while (true) {
  if (cursor.current.kind === HtmlKind.Element) {
    console.log('element at', cursor.current.range);
  }
  if (cursor.gotoFirstChild()) continue;
  while (!cursor.gotoNextSibling()) {
    if (!cursor.gotoParent()) break visit;
  }
}

HtmlKind is a const object. Numeric kind IDs are stable across versions. Never use raw numbers — always reference HtmlKind.<name> so that future kind additions don’t silently break your code.

Full exports

Export	Type	Description
`parseHtml`	`(source: Uint8Array, options?: ParseHtmlOptions) => LexTree`	Parse HTML. Never throws.
`serializeHtml`	`(input: LexTree \| LexNode, options?: SerializeHtmlOptions) => string`	Serialize to HTML string.
`HtmlKind`	`const` object	Numeric kind IDs for all HTML node types.
`HtmlField`	`const` object	Numeric field IDs (`Tag`, `Attrs`, `Name`, `Value`, `Body`, `Error`, …).
`HTML_FIELD_NAMES_BY_ID`	`Readonly<Record<number, string>>`	Field-name lookup by numeric field ID. Wired into every parsed tree, so `fieldName()`/`childByField()` work without setup.
`HtmlParseMode`	`const` object	`Document`, `Fragment`
`HtmlParseErrorCode`	`const` object	Parse error code constants.
`tokenize`	`(source, options?) => Token[]`	Standalone WHATWG tokenizer. Most applications use `parseHtml` instead.
`TokenKind`	`const` object	Token kind IDs for the standalone tokenizer.
`DEFAULT_CONTENT_MODES`	`ContentModeOverrides`	Default rawtext/RCDATA content-mode table for the tokenizer.
`htmlGrammar`	`LanexioParserPureGrammar`	Grammar descriptor — pass to `createParser` from `@lanexio/parser`.