Recursive HTML entity decoding for Ranges workflow

§ Quick Take

import { strict as assert } from "assert";
import decode from "ranges-ent-decode";

// see codsen.com/ranges/
assert.deepEqual(decode("a & b & c"), [
  [2, 8, "&"], // <--- that's Ranges notation, instructing to replace
  [11, 16, "&"],

§ Purpose

This is a wrapper on top of market-leading HTML entity decoder he.jsopens in a new tab decode() which returns ranges instead of string.

We tested the hell out of the code, directly and up-the-dependency-stream but as a cherry on top, all he.jsopens in a new tab unit tests were ported to node-tap and do pass.


crop(str, [opts])

In other words, it's a function which takes a string and an optional options object.

§ API - Input

Input argumentKey value's typeObligatory?Description
inputStringyesText you want to strip HTML tags from
optsPlain objectnoThe Optional Options Object, see below for its API

If any input arguments supplied are in any other types, an error will be thrown.

§ API - Output

Returns rangesnull or array of one or more range arrays.

§ Optional Options Object

The Optional Options Object completely matches the he.jsopens in a new tab options as of v.1.1.1:

An Optional Options Object's keyType of its valueDefaultDescription
isAttributeValueBooleanfalseIf on, entities will be decoded as if they were in attribute values. If off (default), entities will be decoded as if they were in HTML text. Read more hereopens in a new tab.
strictBooleanfalseIf on, entities that can cause parsing errors will cause throws. Read more hereopens in a new tab.

Here is the Optional Options Object in one place (in case you ever want to copy it whole):

isAttributeValue: false,
strict: false

§ More on the algorithm

The biggest pain to code and the main USPopens in a new tab of this library is being able to recursively decode and give the result as ranges.

By recursively, we mean, the input string is decoded over and over until there's no difference in the result between previous and last decoding. Practically, this means we can tackle the unlikely, but possible cases of double and triple encoded strings, for example, this is a double-encoded string: &amp;mdash;. The original m-dash was turned into &mdash; on the first encoding round; then during second round its ampersand got turned into &amp; which lead to &amp;mdash;.

By ranges we mean, the result is not a decoded string, but instructions — what to change in that string in order for the string to be decoded. Practically, this means, we decode and don't lose the original character indexes. In turn, this means, we can gather more "instructions" (ranges) and join them later.

§ Where's encode?

If you wonder, where's encode() in ranges, we don't need it! When you traverse the string and gather ranges, you can pass each ~code point~ grapheme (where emoji of length six should be counted "one") through he.js encode, compare "before" and "after" and if the two are different, create a new range for it.

The decode() is not that simple because the input string has to be processed, you can't iterate grapheme-by-grapheme (or character-by-character, if you don't care about Unicode's astral characters).

§ Licence

MITopens in a new tab

Copyright © 2010–2020 Roy Revelt and other contributors

Related packages:

📦 ranges-apply 3.2.3
Take an array of string index ranges, delete/replace the string according to them
📦 ranges-regex 2.1.3
Integrate regex operations into Ranges workflow
📦 ranges-merge 5.0.3
Merge and sort string index ranges
📦 ranges-is-index-within 1.15.2
Checks if index is within any of the given string index ranges
📦 ranges-iterate 1.1.48
Iterate a string and any changes within given string index ranges
📦 ranges-push 3.7.22
Gather string index ranges
📦 ranges-process-outside 2.2.35
Iterate string considering ranges, as if they were already applied