§ Quick Take
import { strict as assert } from "assert";
import decode from "ranges-ent-decode";
// see codsen.com/ranges/
assert.deepEqual(decode("a & b & c"), [
[2, 8, "&"], // <--- that's Ranges notation, instructing to replace
[11, 16, "&"],
]);
§ Purpose
This is a wrapper on top of market-leading HTML entity decoder he.js decode()
which returns ranges instead of string.
We tested the hell out of the code, directly and up-the-dependency-stream but as a cherry on top, all he.js unit tests were ported to node-tap
and do pass.
§ API
crop(str, [opts])
In other words, it's a function which takes a string and an optional options object.
§ API - Input
Input argument | Key value's type | Obligatory? | Description |
---|---|---|---|
input | String | yes | Text you want to strip HTML tags from |
opts | Plain object | no | The Optional Options Object, see below for its API |
If any input arguments supplied are in any other types, an error will be throw
n.
§ API - Output
Returns ranges — null
or array of one or more range arrays.
§ Optional Options Object
The Optional Options Object completely matches the he.js options as of v.1.1.1
:
An Optional Options Object's key | Type of its value | Default | Description |
---|---|---|---|
isAttributeValue | Boolean | false | If on, entities will be decoded as if they were in attribute values. If off (default), entities will be decoded as if they were in HTML text. Read more here . |
strict | Boolean | false | If on, entities that can cause parsing errors will cause throw s. Read more here . |
Here is the Optional Options Object in one place (in case you ever want to copy it whole):
{
isAttributeValue: false,
strict: false
}
§ More on the algorithm
The biggest pain to code and the main USP of this library is being able to recursively decode and give the result as ranges.
By recursively, we mean, the input string is decoded over and over until there's no difference in the result between previous and last decoding. Practically, this means we can tackle the unlikely, but possible cases of double and triple encoded strings, for example, this is a double-encoded string: &mdash;
. The original m-dash was turned into —
on the first encoding round; then during second round its ampersand got turned into &
which lead to &mdash;
.
By ranges we mean, the result is not a decoded string, but instructions — what to change in that string in order for the string to be decoded. Practically, this means, we decode and don't lose the original character indexes. In turn, this means, we can gather more "instructions" (ranges) and join them later.
§ Where's encode?
If you wonder, where's encode()
in ranges, we don't need it! When you traverse the string and gather ranges, you can pass each ~code point~ grapheme (where emoji of length six should be counted "one") through he.js
encode, compare "before" and "after" and if the two are different, create a new range for it.
The decode()
is not that simple because the input string has to be processed, you can't iterate grapheme-by-grapheme (or character-by-character, if you don't care about Unicode's astral characters).
§ Changelog
See it in the monorepo , on Sourcehut.
§ Licence
Copyright © 2010–2020 Roy Revelt and other contributors