string-fix-broken-named-entities6.0.1

Finds and fixes common and not so common broken named HTML entities, returns ranges array of fixes

Quick Take

import { strict as assert } from "assert";
import { fixEnt } from "string-fix-broken-named-entities";
import { rApply } from "ranges-apply";

const source = "&nsp;x&nsp;y&nsp;";

// returns Ranges notation, see codsen.com/ranges/
assert.deepEqual(fixEnt(source), [
  [0, 5, " "],
  [6, 11, " "],
  [12, 17, " "],
]);

// render result from ranges using "ranges-apply":
assert.equal(
  rApply(source, fixEnt(source)),
  " x y "
);

Examples

Purpose

This program detects and fixes broken named HTML entities (like  ). The algorithm is Levenshtein distance, for smaller entities, we match by distance 1, for longer entities we allow distance 2.

In practice, this means we can catch errors like: &nbp; (mistyped  ).

This program also works as a healthy entities catcher - broken entities are fed to one callback (opts.cb), healthy entities are fed to another callback (opts.entityCatcherCb).

There is a decoding function; the algorithm is aware of numeric HTML entities as well.

API - Input

fixEnt(str, [opts])

The fixEnt you required/imported is a function and it has two input arguments:

Input argument Type Obligatory? Description
input String yes String, hopefully HTML code
opts Plain object no The Optional Options Object, see below for its API

Optional Options Object

An Optional Options Object's key Type of its value Default Description
decode Boolean false Fixed values are normally put as HTML-encoded. Set to true to get raw characters instead.
cb Function or null see below Callback function which gives you granular control of the program's output
entityCatcherCb Function or null null If you set a function here, every encountered entity will be passed to it, see a dedicated chapter below
textAmpersandCatcherCb Function or null null Each raw text ampersand's index will be pinged to this function. See more below.
progressFn Function or null null Used in web worker setups. You pass a function and it gets called once for each natural number 0 to 99, meaning a percentage of the work done so far. See more below.

Here it is in one place:

{
decode: false,
cb: ({ rangeFrom, rangeTo, rangeValEncoded, rangeValDecoded }) =>
rangeValDecoded || rangeValEncoded
? [rangeFrom, rangeTo, opts.decode ? rangeValDecoded : rangeValEncoded]
: [rangeFrom, rangeTo],
entityCatcherCb: null,
progressFn: null
}

API - Output

Output: an array of zero or more arrays (ranges).

What are ranges? Composable string amendment instructions. They are arrays containing "from" and "to" string indexes.

For example, a range [1, 5] means an instruction to delete characters which would otherwise fall into String.slice(1, 5).

For example, a range [2, 6, "foo"] means an instruction to replace characters which would otherwise fall into String.slice(2, 6) with string "foo".

That's all there is — we note pieces of string to be deleted or replaced using character indexes and arrays.

For example, four fixed nbsp's:

[
[6, 11, " "],
[11, 18, " "],
[27, 34, " "],
[34, 41, " "],
];

The output can be further processed by other range libraries: cropping, sorting, merging can be done straight on a ranges notation.

opts.decode

If you set opts.decode and there are healthy encoded entities, those will not be decoded. Only if there are broken entities, those will be set in ranges as decoded values. If you want full decoding, consider filtering the input with a dedicated decoding library right after filtering using this library.

For example, you'd first filter the string using this library, then you'd filter the same input skipping already recorded ranges, using ranges-ent-decode. Then you'd merge the ranges.

For example:

const fixEnt = require("string-fix-broken-named-entities");
const result = fixEnt("zz nbsp;zz nbsp;", { decode: true });
console.log(JSON.stringify(result, null, 4));
// => [[3, 8, "\xA0"], [11, 16, "\xA0"]]

opts.cb - a callback function

So, normally, the output of this library is an array of zero or more arrays (each meaning string index ranges), for example:

[
[1, 2],
[3, 4]
]

Above means, delete the string from index 1 to 2 and from 3 to 4.

However, for example, in emlint, we need slightly different format, not only ranges but also issue titles:

[
{
"name": "tag-generic-error",
"position": [[1, 2]]
},
{
"name": "tag-generic-error",
"position": [[3, 4]]
}
]

Callback function via opts.cb allows you to change the output of this library.

The concept is, you pass a function in options object's key cb. That function will receive a plain object with all "ingredients" under various keys. Whatever you return, will be pushed into a results array. For each result application is about to push, it will call your function with findings, all neatly put in the plain object.

For example, to solve the example above, we would do:

const fixEnt = require("string-fix-broken-named-entities");
const res = fixEnt("zzznbsp;zzznbsp;", {
cb: (oodles) => {
// "oodles" or whatever you name it, is a plain object.
// Grab any content from any of its keys, for example:
// {
// ruleName: "missing semicolon on π (don't confuse with ϖ)",
// entityName: "pi",
// rangeFrom: 3,
// rangeTo: 4,
// rangeValEncoded: "π",
// rangeValDecoded: "\u03C0"
// }
return {
name: oodles.ruleName,
position:
oodles.rangeValEncoded != null
? [oodles.rangeFrom, oodles.rangeTo, oodles.rangeValEncoded]
: [oodles.rangeFrom, oodles.rangeTo],
};
},
});
console.log(JSON.stringify(res, null, 4));
// => [
// {
// name: "malformed  ",
// position: [3, 8, " "]
// },
// {
// name: "malformed  ",
// position: [11, 16, " "]
// }
// ]

Here's the detailed description of all the keys, values and their types:

name of the key in the object in the first argument of a callback function example value value's type description
ruleName missing semicolon on π (don't confuse with ϖ) string Full name of the issue, suitable for linters
entityName pi string Just the name of the entity, without ampersand or semicolon. Case sensitive
rangeFrom 3 (natural) number (string index) Shows from where to delete
rangeTo 8 (natural) number (string index) Shows up to where to delete
rangeValEncoded π string or null Encoded entity or null if fix should just delete that index range and there's nothing to insert
rangeValDecoded \u03C0 string or null Decoded entity or null if fix should just delete that index range and there's nothing to insert

opts.decode in relation to opts.cb

Even though it might seem that when a callback is used, opts.decode does not matter (because we serve both encoded and decoded values in a callback), but it does matter.

For example, consider this case, where we have non-breaking spaces without semicolons:

&nbsp,&nbsp,&nbsp

Since we give user an option to choose between raw and encoded values, result can come in two ways:

When decoded entities are requested, we replace ranges [0, 5], [6, 11] and [12, 17]:

// ranges:
[
[0, 5, "\xA0"],
[6, 11, "\xA0"],
[12, 17, "\xA0"],
];

But, when encoded entities are requested, it's just a matter of sticking in the missing semicolon, at indexes 5, 11 and 17:

// ranges:
[
[5, 5, ";"],
[11, 11, ";"],
[17, 17, ";"],
];

opts.entityCatcherCb

If broken entities are pinged to opts.cb() callback, all healthy entities are pinged to opts.entityCatcherCb. It's either one or another:

const inp1 = "y   z &nsp;";
const gatheredEntityRanges = [];
fix(inp1, {
entityCatcherCb: (from, to) => gatheredEntityRanges.push([from, to]),
});
console.log(
`${`\u001b[${33}m${`gatheredEntityRanges`}\u001b[${39}m`} = ${JSON.stringify(
gatheredEntityRanges,
null,
4
)}
`

);
// => [[2, 8]]

opts.textAmpersandCatcherCb

Sometimes input string can contain ampersands-as-text and ampersands-as-part-of-entities.

For example, consider a string abc& &xyz. What do you see here? There's one named HTML entity,  , surrounded by two raw text ampersands. This callback, opts.textAmpersandCatcherCb can be used to catch raw ampersands, probably with aim to HTML-encode them. Specifically, this option would call your function twice, with numbers 3 and 10, positions of raw ampersands.

See the supplied example where broken entity is fixed and raw text ampersands are encoded, all within the same string, all done using this program.

opts.progressFn

The purpose of web workers is to offload the CPU-heavy calculations to a separate CPU thread. It makes the UI responsive during calculations; the web app does not freeze. They're pretty simple to implement; check the examples on MDN opens in a new tab. For example, this website's search function is on a web worker. Detergent opens in a new tab uses one, Email-Comb opens in a new tab uses one. It's a no-brainer.

In web worker setups, a worker can return "in progress" values. When we put this package into a web worker, this callback function under opts.progress will be called with a string containing a natural number, showing the percentage of the work done so far.

It's hard to show a minimal worker application here but at least here's how the pinging progress works from the side of this npm package:

// let's define a variable on a higher scope:
let count = 0;

// call application as normal, pass opts.progressFn:
const result = fixEnt(
"text &ang text&ang text text &ang text&ang text text &ang text&ang text",
{
progressFn: (percentageDone) => {
// console.log(`percentageDone = ${percentageDone}`);
count++;
},
}
);
// each time percentage is reported, "count" is incremented

// now imagine if instead of incrementing the count, we pinged the
// value out of the worker

Changelog

See it in the monorepo opens in a new tab, on GitHub.

Contributing

To report bugs or request features or assistance, raise an issue on GitHub opens in a new tab.

Any code contributions welcome! All Pull Requests will be dealt promptly.

Licence

MIT opens in a new tab

Copyright © 2010–2021 Roy Revelt and other contributors

Related packages:

📦 ranges-apply 6.0.1
Take an array of string index ranges, delete/replace the string according to them
📦 detergent 8.0.1
Extracts, cleans and encodes text
📦 leven opens in a new tab
Measure the difference between two strings using the Levenshtein distance algorithm
📦 string-range-expander 3.0.1
Expands string index ranges within whitespace boundaries until letters are met
📦 string-match-left-right 8.0.1
Match substrings on the left or right of a given index, ignoring whitespace
📦 string-overlap-one-on-another 3.0.1
Lay one string on top of another, with an optional offset
📦 string-collapse-leading-whitespace 6.0.1
Collapse the leading and trailing whitespace of a string