NAME
html/tokenizer - HTML input stream and tokenizer.
SYNOPSIS
from html/tokenizer import HTMLTokenizer;
let tokenizer := new HTMLTokenizer( _input: "<p title='x'>&</p>" );
let tokens := tokenizer.tokenize();
NOTE
This module is not normally useful to end users. Instead use html/parser.
DESCRIPTION
This module implements the tokenizer layer for html/parser. It accepts already-decoded ZuzuScript strings, normalizes line endings, tracks source position, emits HTML tokenizer tokens, and records non-fatal parse errors with line, column, offset, and tokenizer state.
It intentionally does not build a DOM tree. html/parser re-exports these classes for focused tokenizer tests and for the tree builder. The tokenizer exposes setAllowCDATA and allowCDATA so the tree builder can recognise CDATA sections only while processing SVG or MathML foreign content.
EXPORTS
Classes
HTMLTokenizerTokenizer for HTML strings. Construct it with
_inputor callreset(String input, String state?)to reuse it.tokenize()returns all tokens.nextToken()returns one token at a time and eventually an EOF token.Public state methods are
state,setState,setLastStartTagName,lastStartTagName,setAllowCDATA,allowCDATA, anderrors.setStateaccepts tokenizer state names such asdata,rcdata,rawtext,script_data, andplaintext.errors()returns a copy of the parse errors emitted during the last tokenization run.HTMLInputStreamInput stream used by
HTMLTokenizer. It normalizes CRLF and CR line endings to LF, tracks source position, and exposessource,offset,line,column,lastOffset,lastLine,lastColumn,eof,consume,reconsume,peek, andmatch. Most users should useHTMLTokenizerinstead of consuming the stream directly.HTMLTokenToken object emitted by the tokenizer.
type()returns values such asstart_tag,end_tag,characters,comment,doctype, andeof. Other accessors aredata,tagName,attributes,getAttribute,hasAttribute,selfClosing,publicId,systemId,forceQuirks, andtoDebugString.HTMLParseErrorNon-fatal tokenizer or tree-construction error. Accessors are
code,message,line,column,offset,state, andto_String.HTMLNamedCharacterReferencesSmall named-reference table wrapper.
table()returns the current mapping,get(String name)returns a named reference ornull,isComplete()returns false, andcoverage()describes the partial coverage.
CHARACTER REFERENCES
Numeric decimal and hexadecimal references are implemented, including HTML replacement handling for null, surrogate, out-of-range, and C1 Windows-1252 values. The named-reference table is deliberately partial and covers the common entities needed by the focused tokenizer suite: amp, lt, gt, quot, apos, nbsp, copy, reg, and not, with their semicolon forms and the legacy no-semicolon forms used by HTML tokenization.
LIMITATIONS
There is no DOM tree construction inside this module, no html5lib .dat harness, full CSS selector support, or full WHATWG named-character-reference table.
COPYRIGHT AND LICENCE
html/tokenizer is copyright Toby Inkster.
It is free software; you may redistribute it and/or modify it under the terms of either the Artistic License 1.0 or the GNU General Public License version 2.