codsen-parser vs. hyntax

by — posted on

Imagine the HTML: <a></b>.

Parser-wise, how would you design the AST architecture, considering cases like above?

hyntax HTML parser, for example, interprets it like this:

{
"nodeType": "document",
"content": {
"children": [
{
"nodeType": "tag",
"parentRef": "[Circular ~]",
"content": {
"openStart": {
"type": "token:open-tag-start",
"content": "<a",
"startPosition": 0,
"endPosition": 1
},
"name": "a",
"openEnd": {
"type": "token:open-tag-end",
"content": ">",
"startPosition": 2,
"endPosition": 2
},
"selfClosing": false
}
}
]
}
}

Notice the </b> is not even mentioned here!

In contrast, codsen-parser:

[
{
"type": "tag",
"start": 0,
"end": 3,
"value": "<a>",
"tagNameStartsAt": 1,
"tagNameEndsAt": 2,
"tagName": "a",
"recognised": true,
"closing": false,
"void": false,
"pureHTML": true,
"kind": "inline",
"attribs": [],
"children": []
},
{
"type": "tag",
"start": 3,
"end": 7,
"value": "</b>",
"tagNameStartsAt": 5,
"tagNameEndsAt": 6,
"tagName": "b",
"recognised": true,
"closing": true,
"void": false,
"pureHTML": true,
"kind": "inline",
"attribs": [],
"children": []
}
]

The takeaway is that "normal" parsers like hyntax are not aimed at tackling broken code. That's why I'm working on codsen-parser and codsen-tokenizer. The tokenizer builds tokens, plain objects, and the parser consumes tokenizer, taking those tokens and nesting them into an object tree.

I decided to ship the parser and tokenizer as separate programs because somebody might not need the AST, the plain object tree.

Related packages:

📦 codsen-parser 0.12.1
Parser aiming at broken or mixed code, especially HTML & CSS
📦 codsen-tokenizer 6.0.1
HTML and CSS lexer aimed at code with fatal errors, accepts mixed coding languages
📦 hyntax opens in a new tab
Straightforward HTML parser for Node.js and browser