Parsing HTML
Package: @lanexio/parser-grammar-html Stable
Layer: 2 (Grammar). Depends only on @lanexio/parser-core.
Runtime: Universal (browser, server, edge worker).
Overview
Section titled “Overview”parseHtml implements the full WHATWG HTML parsing algorithm, including all 23 insertion mode state machines, adoption agency, foster parenting, and foreign content (SVG, MathML). It passes every ingested html5lib tree-construction and tokenizer case with empty skip and expected-fail lists — 5,241 passing tests in the full package suite, with the only 8 skips being scripting-enabled (#script-on) variants that no static parser implements.
Parse input is always a Uint8Array. Output is always a LexTree, even for empty or malformed input.
Import
Section titled “Import”import { parseHtml, serializeHtml, HtmlKind, HtmlField, HtmlParseMode, type ParseHtmlOptions, type SerializeHtmlOptions,} from '@lanexio/parser-grammar-html';Parse a document
Section titled “Parse a document”import { parseHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();const tree = parseHtml(encoder.encode('<!doctype html><html><body><p>Hello</p></body></html>'));
console.log(tree.nodeCount); // total nodesconsole.log(tree.root.kind); // Document root kind idparseHtml accepts a Uint8Array. Always use TextEncoder when converting a string to bytes.
ParseHtmlOptions
Section titled “ParseHtmlOptions”| Field | Type | Default | Description |
|---|---|---|---|
mode | HtmlParseMode | 'document' | Parse as a full document or as a fragment. |
contextElement | string | undefined | Context element name for fragment parsing (e.g. 'body'). |
contextElementLeaf | string | undefined | Leaf element name for void-element fragments. |
scriptingEnabled | boolean | false | Whether scripting is considered enabled. Affects <noscript> parsing. |
Fragment parsing
Section titled “Fragment parsing”import { parseHtml, HtmlParseMode } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();const tree = parseHtml( encoder.encode('<li>Item one</li><li>Item two</li>'), { mode: HtmlParseMode.Fragment, contextElement: 'ul' });Fragment parsing is how browsers parse innerHTML. Pass the tag name of the context element.
Detect LexError nodes
Section titled “Detect LexError nodes”import { parseHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();const tree = parseHtml(encoder.encode('<table><p>bad nesting'));
for (const node of tree.root.children()) { if (node.hasError) { console.log('parse error at', node.range); }}parseHtml never throws. Malformed input — bad nesting, unclosed tags, illegal characters — produces LexError nodes in the AST. The parser always recovers and continues.
Navigate by field name
Section titled “Navigate by field name”HTML trees carry field metadata, so LexNode.childByField() and fieldName() work out of the box. An Element node’s structural children are addressable by role:
| Field name | Child | Description |
|---|---|---|
tag | StartTag | The element’s start-tag node (source bytes of <div …>). |
attrs | AttributeList | Container of Attribute nodes. |
name | AttributeName | An attribute’s name (child of Attribute). |
value | AttributeValue | An attribute’s value (child of Attribute; absent when the attribute has no value). |
body | Element (body role) | Container of the element’s content children. |
import { parseHtml, HtmlKind } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();const tree = parseHtml(encoder.encode('<a href="/docs" class="nav">Docs</a>'));
// Find the first Element, then read its attributes by field.for (const node of tree.root.children()) { // The Document's first child is the body-role container; descend to elements. for (const el of node.children()) { if (el.kind !== HtmlKind.Element) continue; const attrList = el.childByField('attrs'); if (!attrList) continue; for (const attr of attrList.children()) { const name = attr.childByField('name'); const value = attr.childByField('value'); console.log(name?.text, '=', value?.text ?? '(no value)'); // href = "/docs" class = "nav" } }}tree.fieldIdForName('attrs') returns the numeric field id for hot loops that compare node.fieldId directly.
Serialize back to HTML
Section titled “Serialize back to HTML”import { parseHtml, serializeHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();const tree = parseHtml(encoder.encode('<p>Hello <b>world</b>'));
const html = serializeHtml(tree);console.log(html); // "<html><head></head><body><p>Hello <b>world</b></p></body></html>"serializeHtml always returns a string and never throws — including on pathologically deep trees (it is fully iterative; 50,000+ levels of nesting are covered by regression tests). Results are memoized per tree: serializing the same LexTree twice returns the cached string.
SerializeHtmlOptions
Section titled “SerializeHtmlOptions”| Field | Type | Default | Description |
|---|---|---|---|
outer | boolean | true | When true, serialize the root node and all its children (outerHTML semantics). When false, serialize only the children (innerHTML semantics). |
wasmScanner | WasmScanExports | undefined | Optional WASM escape-scanner exports for native-speed text/attribute escaping. When omitted, the pure-TypeScript byte path is used. Output is byte-identical either way (cross-checked by an escape-consistency property suite). |
Partial serialization
Section titled “Partial serialization”import { parseHtml, serializeHtml } from '@lanexio/parser-grammar-html';
const encoder = new TextEncoder();const tree = parseHtml(encoder.encode('<div><p>First</p><p>Second</p></div>'));
// Serialize only the children of the first elementconst root = tree.root;const first = root.child(0);if (first) { console.log(serializeHtml(first, { outer: false }));}HtmlKind constants
Section titled “HtmlKind constants”import { HtmlKind } from '@lanexio/parser-grammar-html';
// HtmlKind is a const object. Use 'as const' pattern, never enum.const kind: typeof HtmlKind[keyof typeof HtmlKind] = HtmlKind.Element;
// Example: walk only element nodesconst cursor = tree.cursor();visit: while (true) { if (cursor.current.kind === HtmlKind.Element) { console.log('element at', cursor.current.range); } if (cursor.gotoFirstChild()) continue; while (!cursor.gotoNextSibling()) { if (!cursor.gotoParent()) break visit; }}HtmlKind is a const object. Numeric kind IDs are stable across versions. Never use raw numbers — always reference HtmlKind.<name> so that future kind additions don’t silently break your code.
Full exports
Section titled “Full exports”| Export | Type | Description |
|---|---|---|
parseHtml | (source: Uint8Array, options?: ParseHtmlOptions) => LexTree | Parse HTML. Never throws. |
serializeHtml | (input: LexTree | LexNode, options?: SerializeHtmlOptions) => string | Serialize to HTML string. |
HtmlKind | const object | Numeric kind IDs for all HTML node types. |
HtmlField | const object | Numeric field IDs (Tag, Attrs, Name, Value, Body, Error, …). |
HTML_FIELD_NAMES_BY_ID | Readonly<Record<number, string>> | Field-name lookup by numeric field ID. Wired into every parsed tree, so fieldName()/childByField() work without setup. |
HtmlParseMode | const object | Document, Fragment |
HtmlParseErrorCode | const object | Parse error code constants. |
tokenize | (source, options?) => Token[] | Standalone WHATWG tokenizer. Most applications use parseHtml instead. |
TokenKind | const object | Token kind IDs for the standalone tokenizer. |
DEFAULT_CONTENT_MODES | ContentModeOverrides | Default rawtext/RCDATA content-mode table for the tokenizer. |
htmlGrammar | LanexioParserPureGrammar | Grammar descriptor — pass to createParser from @lanexio/parser. |