string-fix-broken-named-entities4.0.1
§ Quick Take
import { strict as assert } from "assert";
import fixEnt from "string-fix-broken-named-entities";
import applyR from "ranges-apply";
const source = "&nsp;x&nsp;y&nsp;";
// returns Ranges notation, see codsen.com/ranges/
assert.deepEqual(fixEnt(source), [
[0, 5, " "],
[6, 11, " "],
[12, 17, " "],
]);
// render result from ranges using "ranges-apply":
assert.equal(
applyR(source, fixEnt(source)),
" x y "
);
§ Purpose
This program detects and fixes broken named HTML entities (like
). The algorithm is Levenshtein distance, for smaller entities, we match by distance 1
, for longer entities we allow distance 2
.
In practice, this means we can catch errors like: &nbp;
(mistyped
).
This program also works as a healthy entities catcher - broken entities are fed to one callback (opts.cb
), healthy entities are fed to another callback (opts.entityCatcherCb
).
There is a decoding function; the algorithm is aware of numeric HTML entities as well.
§ API - Input
The fixEnt
you required/imported is a function and it has two input arguments:
Input argument | Type | Obligatory? | Description |
---|---|---|---|
input | String | yes | String, hopefully HTML code |
opts | Plain object | no | The Optional Options Object, see below for its API |
For example:
const fixEnt = require("string-fix-broken-named-entities");
§ Optional Options Object
An Optional Options Object's key | Type of its value | Default | Description |
---|---|---|---|
decode | Boolean | false | Fixed values are normally put as HTML-encoded. Set to true to get raw characters instead. |
cb | Function | see below | Callback function which gives you granular control of the program's output |
entityCatcherCb | Function | null | If you set a function here, every encountered entity will be passed to it, see a dedicated chapter below |
progressFn | Function | null | Used in web worker setups. You pass a function and it gets called once for each natural number 0 to 99 , meaning a percentage of the work done so far |
Here it is in one place:
{
decode: false,
cb: ({ rangeFrom, rangeTo, rangeValEncoded, rangeValDecoded }) =>
rangeValDecoded || rangeValEncoded
? [rangeFrom, rangeTo, opts.decode ? rangeValDecoded : rangeValEncoded]
: [rangeFrom, rangeTo],
entityCatcherCb: null,
progressFn: null
}
§ API - Output
Output: an array of zero or more arrays (ranges).
What are ranges? Composable string amendment instructions. They are arrays containing "from" and "to" string indexes.
For example, a range [1, 5]
means an instruction to delete characters which would otherwise fall into String.slice(1, 5)
.
For example, a range [2, 6, "foo"]
means an instruction to replace characters which would otherwise fall into String.slice(2, 6)
with string "foo"
.
That's all there is — we note pieces of string to be deleted or replaced using character indexes and arrays.
For example, four fixed nbsp
's:
[
[6, 11, " "],
[11, 18, " "],
[27, 34, " "],
[34, 41, " "],
];
The output can be further processed by other range libraries: cropping, sorting, merging can be done straight on a ranges notation.
§ opts.decode
If you set opts.decode
and there are healthy encoded entities, those will not be decoded. Only if there are broken entities, those will be set in ranges as decoded values. If you want full decoding, consider filtering the input with a dedicated decoding library right after filtering using this library.
For example, you'd first filter the string using this library, then you'd filter the same input skipping already recorded ranges, using ranges-ent-decode. Then you'd merge the ranges.
For example:
const fixEnt = require("string-fix-broken-named-entities");
const result = fixEnt("zz nbsp;zz nbsp;", { decode: true });
console.log(JSON.stringify(result, null, 4));
// => [[3, 8, "\xA0"], [11, 16, "\xA0"]]
§ opts.cb
- a callback function
So, normally, the output of this library is an array of zero or more arrays (each meaning string index ranges), for example:
[
[1, 2],
[3, 4]
]
Above means, delete the string from index 1
to 2
and from 3
to 4
.
However, for example, in emlint
, we need slightly different format, not only ranges but also issue titles:
[
{
"name": "tag-generic-error",
"position": [[1, 2]]
},
{
"name": "tag-generic-error",
"position": [[3, 4]]
}
]
Callback function via opts.cb
allows you to change the output of this library.
The concept is, you pass a function in options object's key cb
. That function will receive a plain object with all "ingredients" under various keys. Whatever you return, will be pushed into a results array. For each result application is about to push, it will call your function with findings, all neatly put in the plain object.
For example, to solve the example above, we would do:
const fixEnt = require("string-fix-broken-named-entities");
const res = fixEnt("zzznbsp;zzznbsp;", {
cb: (oodles) => {
// "oodles" or whatever you name it, is a plain object.
// Grab any content from any of its keys, for example:
// {
// ruleName: "missing semicolon on π (don't confuse with ϖ)",
// entityName: "pi",
// rangeFrom: 3,
// rangeTo: 4,
// rangeValEncoded: "π",
// rangeValDecoded: "\u03C0"
// }
return {
name: oodles.ruleName,
position:
oodles.rangeValEncoded != null
? [oodles.rangeFrom, oodles.rangeTo, oodles.rangeValEncoded]
: [oodles.rangeFrom, oodles.rangeTo],
};
},
});
console.log(JSON.stringify(res, null, 4));
// => [
// {
// name: "malformed ",
// position: [3, 8, " "]
// },
// {
// name: "malformed ",
// position: [11, 16, " "]
// }
// ]
Here's the detailed description of all the keys, values and their types:
name of the key in the object in the first argument of a callback function | example value | value's type | description |
---|---|---|---|
ruleName | missing semicolon on π (don't confuse with ϖ) | string | Full name of the issue, suitable for linters |
entityName | pi | string | Just the name of the entity, without ampersand or semicolon. Case sensitive |
rangeFrom | 3 | (natural) number (string index) | Shows from where to delete |
rangeTo | 8 | (natural) number (string index) | Shows up to where to delete |
rangeValEncoded | π | string or null | Encoded entity or null if fix should just delete that index range and there's nothing to insert |
rangeValDecoded | \u03C0 | string or null | Decoded entity or null if fix should just delete that index range and there's nothing to insert |
§ opts.decode
in relation to opts.cb
Even though it might seem that when a callback is used, opts.decode
does not matter (because we serve both encoded and decoded values in a callback), but it does matter.
For example, consider this case, where we have non-breaking spaces without semicolons:
 , , 
Since we give user an option to choose between raw and encoded values, result can come in two ways:
When decoded entities are requested, we replace ranges [0, 5]
, [6, 11]
and [12, 17]
:
// ranges:
[
[0, 5, "\xA0"],
[6, 11, "\xA0"],
[12, 17, "\xA0"],
];
But, when encoded entities are requested, it's just a matter of sticking in the missing semicolon, at indexes 5
, 11
and 17
:
// ranges:
[
[5, 5, ";"],
[11, 11, ";"],
[17, 17, ";"],
];
§ opts.entityCatcherCb
If broken entities are pinged to opts.cb()
callback, all healthy entities are pinged to opts.entityCatcherCb
. It's either one of another:
const inp1 = "y z &nsp;";
const gatheredEntityRanges = [];
fix(inp1, {
entityCatcherCb: (from, to) => gatheredEntityRanges.push([from, to]),
});
console.log(
`${`\u001b[${33}m${`gatheredEntityRanges`}\u001b[${39}m`} = ${JSON.stringify(
gatheredEntityRanges,
null,
4
)}`
);
// => [[2, 8]]
§ opts.progressFn
In web worker setups, a worker can return "in progress" values. When we put this package into a web worker, this callback function under opts.progress
will be called with a string, containing a natural number, showing the percentage of the work done so far.
It's hard to show minimal worker application here but at least here's how the pinging progress works from the side of this npm package:
// let's define a variable on a higher scope:
let count = 0;
// call application as normal, pass opts.progressFn:
const result = fixEnt(
"text &ang text&ang text text &ang text&ang text text &ang text&ang text",
{
progressFn: (percentageDone) => {
// console.log(`percentageDone = ${percentageDone}`);
t.ok(typeof percentageDone === "number");
count++;
},
}
);
// each time percentage is reported, "count" is incremented
// now imagine if instead of incrementing the count, we pinged the
// value out of the worker
§ Changelog
See it in the monorepo , on Sourcehut.
§ Licence
Copyright © 2010–2020 Roy Revelt and other contributors