Skip to content

Parsing HTML

Package: @lanexio/parser-grammar-html Stable
Layer: 2 (Grammar). Depends only on @lanexio/parser-core.
Runtime: Universal (browser, server, edge worker).

parseHtml implements the full WHATWG HTML parsing algorithm, including all 23 insertion mode state machines, adoption agency, foster parenting, and foreign content (SVG, MathML). It passes every ingested html5lib tree-construction and tokenizer case with empty skip and expected-fail lists — 5,241 passing tests in the full package suite, with the only 8 skips being scripting-enabled (#script-on) variants that no static parser implements.

Parse input is always a Uint8Array. Output is always a LexTree, even for empty or malformed input.

import {
parseHtml,
serializeHtml,
HtmlKind,
HtmlField,
HtmlParseMode,
type ParseHtmlOptions,
type SerializeHtmlOptions,
} from '@lanexio/parser-grammar-html';
import { parseHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<!doctype html><html><body><p>Hello</p></body></html>'));
console.log(tree.nodeCount); // total nodes
console.log(tree.root.kind); // Document root kind id

parseHtml accepts a Uint8Array. Always use TextEncoder when converting a string to bytes.

FieldTypeDefaultDescription
modeHtmlParseMode'document'Parse as a full document or as a fragment.
contextElementstringundefinedContext element name for fragment parsing (e.g. 'body').
contextElementLeafstringundefinedLeaf element name for void-element fragments.
scriptingEnabledbooleanfalseWhether scripting is considered enabled. Affects <noscript> parsing.
import { parseHtml, HtmlParseMode } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();
const tree = parseHtml(
encoder.encode('<li>Item one</li><li>Item two</li>'),
{ mode: HtmlParseMode.Fragment, contextElement: 'ul' }
);

Fragment parsing is how browsers parse innerHTML. Pass the tag name of the context element.

import { parseHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<table><p>bad nesting'));
for (const node of tree.root.children()) {
if (node.hasError) {
console.log('parse error at', node.range);
}
}

parseHtml never throws. Malformed input — bad nesting, unclosed tags, illegal characters — produces LexError nodes in the AST. The parser always recovers and continues.

HTML trees carry field metadata, so LexNode.childByField() and fieldName() work out of the box. An Element node’s structural children are addressable by role:

Field nameChildDescription
tagStartTagThe element’s start-tag node (source bytes of <div …>).
attrsAttributeListContainer of Attribute nodes.
nameAttributeNameAn attribute’s name (child of Attribute).
valueAttributeValueAn attribute’s value (child of Attribute; absent when the attribute has no value).
bodyElement (body role)Container of the element’s content children.
import { parseHtml, HtmlKind } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<a href="/docs" class="nav">Docs</a>'));
// Find the first Element, then read its attributes by field.
for (const node of tree.root.children()) {
// The Document's first child is the body-role container; descend to elements.
for (const el of node.children()) {
if (el.kind !== HtmlKind.Element) continue;
const attrList = el.childByField('attrs');
if (!attrList) continue;
for (const attr of attrList.children()) {
const name = attr.childByField('name');
const value = attr.childByField('value');
console.log(name?.text, '=', value?.text ?? '(no value)');
// href = "/docs" class = "nav"
}
}
}

tree.fieldIdForName('attrs') returns the numeric field id for hot loops that compare node.fieldId directly.

import { parseHtml, serializeHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<p>Hello <b>world</b>'));
const html = serializeHtml(tree);
console.log(html); // "<html><head></head><body><p>Hello <b>world</b></p></body></html>"

serializeHtml always returns a string and never throws — including on pathologically deep trees (it is fully iterative; 50,000+ levels of nesting are covered by regression tests). Results are memoized per tree: serializing the same LexTree twice returns the cached string.

FieldTypeDefaultDescription
outerbooleantrueWhen true, serialize the root node and all its children (outerHTML semantics). When false, serialize only the children (innerHTML semantics).
wasmScannerWasmScanExportsundefinedOptional WASM escape-scanner exports for native-speed text/attribute escaping. When omitted, the pure-TypeScript byte path is used. Output is byte-identical either way (cross-checked by an escape-consistency property suite).
import { parseHtml, serializeHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();
const tree = parseHtml(encoder.encode('<div><p>First</p><p>Second</p></div>'));
// Serialize only the children of the first element
const root = tree.root;
const first = root.child(0);
if (first) {
console.log(serializeHtml(first, { outer: false }));
}
import { HtmlKind } from '@lanexio/parser-grammar-html';
// HtmlKind is a const object. Use 'as const' pattern, never enum.
const kind: typeof HtmlKind[keyof typeof HtmlKind] = HtmlKind.Element;
// Example: walk only element nodes
const cursor = tree.cursor();
visit: while (true) {
if (cursor.current.kind === HtmlKind.Element) {
console.log('element at', cursor.current.range);
}
if (cursor.gotoFirstChild()) continue;
while (!cursor.gotoNextSibling()) {
if (!cursor.gotoParent()) break visit;
}
}

HtmlKind is a const object. Numeric kind IDs are stable across versions. Never use raw numbers — always reference HtmlKind.<name> so that future kind additions don’t silently break your code.

ExportTypeDescription
parseHtml(source: Uint8Array, options?: ParseHtmlOptions) => LexTreeParse HTML. Never throws.
serializeHtml(input: LexTree | LexNode, options?: SerializeHtmlOptions) => stringSerialize to HTML string.
HtmlKindconst objectNumeric kind IDs for all HTML node types.
HtmlFieldconst objectNumeric field IDs (Tag, Attrs, Name, Value, Body, Error, …).
HTML_FIELD_NAMES_BY_IDReadonly<Record<number, string>>Field-name lookup by numeric field ID. Wired into every parsed tree, so fieldName()/childByField() work without setup.
HtmlParseModeconst objectDocument, Fragment
HtmlParseErrorCodeconst objectParse error code constants.
tokenize(source, options?) => Token[]Standalone WHATWG tokenizer. Most applications use parseHtml instead.
TokenKindconst objectToken kind IDs for the standalone tokenizer.
DEFAULT_CONTENT_MODESContentModeOverridesDefault rawtext/RCDATA content-mode table for the tokenizer.
htmlGrammarLanexioParserPureGrammarGrammar descriptor — pass to createParser from @lanexio/parser.