Installation
Quick Take
Examples
Purpose
This program detects and fixes broken named HTML entities (like
). The algorithm is Levenshtein distance, for smaller entities, we match by distance 1
, for longer entities we allow distance 2
.
In practice, this means we can catch errors like: &nbp;
(mistyped
).
This program also works as a healthy entities catcher — broken entities are fed to one callback (opts.cb
), healthy entities are fed to another callback (opts.entityCatcherCb
).
There is a decoding function; the algorithm is aware of numeric HTML entities as well.
API — fixEnt()
The main function fixEnt()
is imported like this:
It’s a function which takes two input arguments:
Input argument | Type | Obligatory | Description |
---|---|---|---|
input Type: String Obligatory: yes | |||
input | String | yes | String, hopefully HTML code |
opts Type: Plain object Obligatory: no | |||
opts | Plain object | no | Optional Options Object. |
The Optional Options Object has the following shape:
Key | Type | Default | Description |
---|---|---|---|
decode Type: Boolean Default: false | |||
decode | Boolean | false | Fixed values are normally put as HTML-encoded. Set to true to get raw characters instead. |
cb Type: Function or null Default: see below | |||
cb | Function or null | see below | Callback function which gives you granular control of the program’s output |
entityCatcherCb Type: Function or null Default: null | |||
entityCatcherCb | Function or null | null | If you set a function here, every encountered entity will be passed to it, see a dedicated chapter below |
textAmpersandCatcherCb Type: Function or null Default: null | |||
textAmpersandCatcherCb | Function or null | null | Each raw text ampersand’s index will be pinged to this function. See more below. |
progressFn Type: Function or null Default: null | |||
progressFn | Function or null | null | Used in web worker setups. You pass a function and it gets called once for each natural number 0 to 99 , meaning a percentage of the work done so far. See more below. |
Function will return ranges (Ranges
type below) describing proposed fixes — a null
(theoretically possible) or array of one or more range arrays:
For example, four fixed nbsp
’s:
[
[6, 11, " "],
[11, 18, " "],
[27, 34, " "],
[34, 41, " "],
];
The output can be further processed by other range libraries: cropping, sorting, merging can be done on range arrays, instead of mutating the input string each time.
opts.decode
If you set opts.decode
and there are healthy encoded entities, those will not be decoded. Only if there are broken entities, those will be set in ranges as decoded values. If you want full decoding, consider filtering the input with a dedicated decoding library right after filtering using this library.
For example, you’d first filter the string using this library, then you’d filter the same input skipping already recorded ranges, using ranges-ent-decode. Then you’d merge the ranges.
For example:
import { fixEnt } from "string-fix-broken-named-entities";
const result = fixEnt("zz nbsp;zz nbsp;", { decode: true });
console.log(JSON.stringify(result, null, 4));
// => [[3, 8, "\xA0"], [11, 16, "\xA0"]]
opts.cb
So, normally, the output of this library is an array of zero or more arrays (each meaning string index ranges), for example:
[
[1, 2],
[3, 4]
]
Above means, delete the string from index 1
to 2
and from 3
to 4
.
However, for example, in emlint
, we need slightly different format, not only ranges but also issue titles:
[
{
"name": "tag-generic-error",
"position": [[1, 2]]
},
{
"name": "tag-generic-error",
"position": [[3, 4]]
}
]
Callback function via opts.cb
allows you to change the output of this library.
The concept is, you pass a function in options object’s key cb
. That function will receive a plain object with all “ingredients” under various keys. Whatever you return, will be pushed into a results array. For each result application is about to push, it will call your function with findings, all neatly put in the plain object.
For example, to solve the example above, we would do:
import { fixEnt } from "string-fix-broken-named-entities";
const res = fixEnt("zzznbsp;zzznbsp;", {
cb: (oodles) => {
// "oodles" or whatever you name it, is a plain object.
// Grab any content from any of its keys, for example:
// {
// ruleName: "missing semicolon on π (don't confuse with ϖ)",
// entityName: "pi",
// rangeFrom: 3,
// rangeTo: 4,
// rangeValEncoded: "π",
// rangeValDecoded: "\u03C0"
// }
return {
name: oodles.ruleName,
position:
oodles.rangeValEncoded != null
? [oodles.rangeFrom, oodles.rangeTo, oodles.rangeValEncoded]
: [oodles.rangeFrom, oodles.rangeTo],
};
},
});
console.log(JSON.stringify(res, null, 4));
// => [
// {
// name: "malformed ",
// position: [3, 8, " "]
// },
// {
// name: "malformed ",
// position: [11, 16, " "]
// }
// ]
Here’s the detailed description of all the keys, values and their types:
Key | Example value | Type | Description |
---|---|---|---|
ruleName Type: string | |||
ruleName | missing semicolon on π (don't confuse with ϖ) | string | Full name of the issue, suitable for linters |
entityName Type: string | |||
entityName | pi | string | Just the name of the entity, without ampersand or semicolon. Case sensitive |
rangeFrom Type: (natural) number (string index) | |||
rangeFrom | 3 | (natural) number (string index) | Shows from where to delete |
rangeTo Type: (natural) number (string index) | |||
rangeTo | 8 | (natural) number (string index) | Shows up to where to delete |
rangeValEncoded Type: string or null | |||
rangeValEncoded | π | string or null | Encoded entity or null if fix should just delete that index range and there’s nothing to insert |
rangeValDecoded Type: string or null | |||
rangeValDecoded | \u03C0 | string or null | Decoded entity or null if fix should just delete that index range and there’s nothing to insert |
opts.decode
in relation to opts.cb
Even though it might seem that when a callback is used, opts.decode
does not matter (because we serve both encoded and decoded values in a callback), but it does matter.
For example, consider this case, where we have non-breaking spaces without semicolons:
 , , 
Since we give user an option to choose between raw and encoded values, result can come in two ways:
When decoded entities are requested, we replace ranges [0, 5]
, [6, 11]
and [12, 17]
:
// ranges:
[
[0, 5, "\xA0"],
[6, 11, "\xA0"],
[12, 17, "\xA0"],
];
But, when encoded entities are requested, it’s just a matter of sticking in the missing semicolon, at indexes 5
, 11
and 17
:
// ranges:
[
[5, 5, ";"],
[11, 11, ";"],
[17, 17, ";"],
];
opts.entityCatcherCb
If broken entities are pinged to opts.cb()
callback, all healthy entities are pinged to opts.entityCatcherCb
. It’s either one or another:
const inp1 = "y z &nsp;";
const gatheredEntityRanges = [];
fix(inp1, {
entityCatcherCb: (from, to) => gatheredEntityRanges.push([from, to]),
});
console.log(
`${`\u001b[${33}m${`gatheredEntityRanges`}\u001b[${39}m`} = ${JSON.stringify(
gatheredEntityRanges,
null,
4
)}`
);
// => [[2, 8]]
opts.textAmpersandCatcherCb
Sometimes input string can contain ampersands-as-text and ampersands-as-part-of-entities.
For example, consider a string abc& &xyz
. What do you see here? There’s one named HTML entity,
, surrounded by two raw text ampersands. This callback, opts.textAmpersandCatcherCb
can be used to catch raw ampersands, probably with aim to HTML-encode them. Specifically, this option would call your function twice, with numbers 3
and 10
, positions of raw ampersands.
See the supplied example where broken entity is fixed and raw text ampersands are encoded, all within the same string, all done using this program.
opts.progressFn
In web worker setups, a worker can return “in progress” values. When we put this package into a web worker, this callback function under opts.progress
will be called with a string containing a natural number, showing the percentage of the work done so far.
It’s hard to show a minimal worker application here but at least here’s how the pinging progress works from the side of this npm package:
// let's define a variable on a higher scope:
let count = 0;
// call application as normal, pass opts.progressFn:
const result = fixEnt(
"text &ang text&ang text text &ang text&ang text text &ang text&ang text",
{
progressFn: (percentageDone) => {
// console.log(`percentageDone = ${percentageDone}`);
count++;
},
}
);
// each time percentage is reported, "count" is incremented
// now imagine if instead of incrementing the count, we pinged the
// value out of the worker
API — version
You can import version
: