modules/html/tokenizer.zzm

NAME

html/tokenizer - HTML input stream and tokenizer.

SYNOPSIS

  from html/tokenizer import HTMLTokenizer;

  let tokenizer := new HTMLTokenizer( _input: "<p title='x'>&amp;</p>" );
  let tokens := tokenizer.tokenize();

NOTE

This module is not normally useful to end users. Instead use html/parser.

DESCRIPTION

This module implements the tokenizer layer for html/parser. It accepts already-decoded ZuzuScript strings, normalizes line endings, tracks source position, emits HTML tokenizer tokens, and records non-fatal parse errors with line, column, offset, and tokenizer state.

It intentionally does not build a DOM tree. html/parser re-exports these classes for focused tokenizer tests and for the tree builder. The tokenizer exposes setAllowCDATA and allowCDATA so the tree builder can recognise CDATA sections only while processing SVG or MathML foreign content.

EXPORTS

Classes

HTMLTokenizer
Tokenizer for HTML strings. Construct it with _input or call reset(String input, String state?) to reuse it. tokenize() returns all tokens. nextToken() returns one token at a time and eventually an EOF token.

Public state methods are state, setState, setLastStartTagName, lastStartTagName, setAllowCDATA, allowCDATA, and errors. setState accepts tokenizer state names such as data, rcdata, rawtext, script_data, and plaintext. errors() returns a copy of the parse errors emitted during the last tokenization run.
HTMLInputStream
Input stream used by HTMLTokenizer. It normalizes CRLF and CR line endings to LF, tracks source position, and exposes source, offset, line, column, lastOffset, lastLine, lastColumn, eof, consume, reconsume, peek, and match. Most users should use HTMLTokenizer instead of consuming the stream directly.
HTMLToken
Token object emitted by the tokenizer. type() returns values such as start_tag, end_tag, characters, comment, doctype, and eof. Other accessors are data, tagName, attributes, getAttribute, hasAttribute, selfClosing, publicId, systemId, forceQuirks, and toDebugString.
HTMLParseError
Non-fatal tokenizer or tree-construction error. Accessors are code, message, line, column, offset, state, and to_String.
HTMLNamedCharacterReferences
Small named-reference table wrapper. table() returns the current mapping, get(String name) returns a named reference or null, isComplete() returns false, and coverage() describes the partial coverage.

CHARACTER REFERENCES

Numeric decimal and hexadecimal references are implemented, including HTML replacement handling for null, surrogate, out-of-range, and C1 Windows-1252 values. The named-reference table is deliberately partial and covers the common entities needed by the focused tokenizer suite: amp, lt, gt, quot, apos, nbsp, copy, reg, and not, with their semicolon forms and the legacy no-semicolon forms used by HTML tokenization.

LIMITATIONS

There is no DOM tree construction inside this module, no html5lib .dat harness, full CSS selector support, or full WHATWG named-character-reference table.

COPYRIGHT AND LICENCE

html/tokenizer is copyright Toby Inkster.

It is free software; you may redistribute it and/or modify it under the terms of either the Artistic License 1.0 or the GNU General Public License version 2.