Parsing Markdown

Package: @lanexio/parser-grammar-markdown Stable
Layer: 2 (Grammar). Depends only on @lanexio/parser-core.
Runtime: Universal (browser, server, edge worker). Serializer: serializeMarkdown — full roundtrip (MD-S4).

Overview

parseMarkdown produces a flat AST from CommonMark 0.31.2 and GitHub Flavored Markdown (GFM) documents. The full spec-example corpora for both dialects are ingested as tests with empty skip and expected-fail lists (1,898 passing tests in the full package suite). GFM extensions are enabled by default. serializeMarkdown roundtrips the parsed tree back to source text — parse → serialize → re-parse produces a structurally equivalent tree.

Import

import {
  parseMarkdown,
  serializeMarkdown,
  MdKind,
  MdField,
  type ParseMarkdownOptions,
} from '@lanexio/parser-grammar-markdown';

Serialize back to Markdown

serializeMarkdown roundtrips a parsed Markdown tree back to source text. Parse → serialize → re-parse produces an AST-identical tree for all CommonMark and GFM features.

import { parseMarkdown, serializeMarkdown } from '@lanexio/parser-grammar-markdown';

const encoder = new TextEncoder();
const tree = parseMarkdown(encoder.encode('# Hello\n\nA paragraph with **bold** text.'));
const markdown = serializeMarkdown(tree);
// → "# Hello\n\nA paragraph with **bold** text.\n"

// Golden roundtrip: parse → serialize → re-parse produces identical AST
const tree2 = parseMarkdown(encoder.encode(markdown));
// tree2 is structurally identical to tree

How it works

Node	Serialization
`Document`	Children separated by blank lines
`AtxHeading`	`#` × level + space + content
`SetextHeading`	Content + `\n===` (h1) or `\n---` (h2)
`Paragraph`	Inline content + newline
`Text`	Escaped literal text
`Emphasis` / `Strong`	`content` / `content`
`CodeSpan`	Backtick-delimited with escaping
`Link` / `Image`	`[text](url)` / `![alt](url)` (inline-only)
`Autolink`	`<url>`
`BlockQuote`	`>` prefix per line
`List` (unordered)	`-` bullet
`List` (ordered)	`1.` , `2.` , …
`Table` (GFM)	Pipe table with alignment
`ThematicBreak`	`---`
`FencedCodeBlock`	``` + info + content + ```
`IndentedCodeBlock`	4-space indented lines
`HtmlBlock` / `RawHtml`	Verbatim
`CharacterReference`	`&#NNN;` verbatim

Characters that would trigger inline formatting (*, _, `, [, ], ~, \) are backslash-escaped to preserve roundtrip fidelity.

serializeMarkdown always returns a string and never throws — it is fully iterative, and 50,000-level-deep block-quote nesting is covered by a regression test in the never-throw suite.

Parse a document

import { parseMarkdown } from '@lanexio/parser-grammar-markdown';

const encoder = new TextEncoder();
const tree = parseMarkdown(encoder.encode('# Hello\n\nA paragraph with **bold** text.'));

console.log(tree.nodeCount);
console.log(tree.root.kind);   // Document root kind id

parseMarkdown accepts a Uint8Array. Always use TextEncoder when converting a string. It never throws.

ParseMarkdownOptions

Field	Type	Default	Description
`gfm`	`boolean`	`true`	Enable GitHub Flavored Markdown extensions (tables, task lists, strikethrough, autolinks).
`extendedAutolink`	`boolean`	`true` (when `gfm: true`)	Enable GFM extended autolink detection. Has no effect when `gfm: false`.

CommonMark-only mode

import { parseMarkdown } from '@lanexio/parser-grammar-markdown';

const encoder = new TextEncoder();
const tree = parseMarkdown(
  encoder.encode('# CommonMark only\n\nNo GFM tables or task lists.'),
  { gfm: false }
);

GFM tables

import { parseMarkdown } from '@lanexio/parser-grammar-markdown';

const encoder = new TextEncoder();
const markdown = `
| Name     | Score |
| -------- | ----- |
| Alice    | 100   |
| Bob      | 95    |
`;
const tree = parseMarkdown(encoder.encode(markdown));  // gfm: true by default

Detect LexError nodes

import { parseMarkdown } from '@lanexio/parser-grammar-markdown';

const encoder = new TextEncoder();
const tree = parseMarkdown(encoder.encode('This is valid. [incomplete link'));

for (const node of tree.root.children()) {
  if (node.hasError) {
    console.log('parse error at', node.range);
  }
}

parseMarkdown never throws. Malformed input produces LexError nodes. CommonMark is forgiving by design, so many inputs that look like “errors” are actually valid by spec.

Markdown serializer

serializeMarkdown is available as of v1.0 (MD-S4). See Serialize back to Markdown above for usage.

The CLI serialize subcommand supports Markdown input:

parser serialize --grammar markdown document.md

MdKind constants

import { MdKind } from '@lanexio/parser-grammar-markdown';

// MdKind is a const object.
const cursor = tree.cursor();
visit: while (true) {
  if (cursor.current.kind === MdKind.AtxHeading) {
    console.log('heading at', cursor.current.range);
  }
  if (cursor.gotoFirstChild()) continue;
  while (!cursor.gotoNextSibling()) {
    if (!cursor.gotoParent()) break visit;
  }
}

MdKind is a const object. Do not use raw numbers.

Full exports

Export	Type	Description
`parseMarkdown`	`(source: Uint8Array, options?: ParseMarkdownOptions) => LexTree`	Parse Markdown. Never throws.
`serializeMarkdown`	`(input: LexTree \| LexNode) => string`	Serialize to Markdown source. Never throws; fully iterative.
`MdKind`	`const` object	Numeric kind IDs for all Markdown node types.
`MdField`	`const` object	Numeric field IDs for Markdown element slots.
`MD_FIELD_NAMES_BY_ID`	`Readonly<Record<number, string>>`	Field-name lookup by numeric field ID.
`MdParseErrorCode`	`const` object	Parse error code constants.
`markdownGrammar`	`LanexioParserPureGrammar`	Grammar descriptor — pass to `createParser` from `@lanexio/parser`.