Installation
Quick Take
The Purpose
We need a program to help to find malformed string instances.
For example, consider opening HTML comment tag, <!--
.
There can be many things wrong with it:
- Missing characters from the set, for example,
<--
or<!-
- Rogue characters present between characters in the set, for example:
<!-.-
or<z!--
- Also rogue whitespace characters:
<! --
or<!- -
Basically, something too similar to what we are looking for, but not exactly the same.
Idea
Levenshtein distance is a number which signifies, how many character changes is needed to turn one string into another.
In technical terms, for example, we would look for a set of characters Levenshtein distance 1
, but disregarding the whitespace.
Difference between dog
, dot
is 1
(“g” needs to be changed into “t”).
Another thing, not all characters are equal (sorry for a pun) — a whitespace should/could be disregarded. For example, five spaces is not the same as any five characters: <! --
is definitely an instance of malformed <!--
but <!<a id--
is very weird — even though both might be Levenshtein distance 5.
Takeaway — program will aggressively chomp the whitespace but it will be sensitive to all other characters.
API — findMalformed()
The main function findMalformed()
is imported like this:
It’s a function which takes three input arguments:
Input argument | Type | Obligatory | Description |
---|---|---|---|
str Type: String Obligatory: yes | |||
str | String | yes | The string in which you want to perform a search |
refStr Type: String Obligatory: yes | |||
refStr | String | yes | What to look for |
cb Type: Function Obligatory: yes | |||
cb | Function | yes | You supply a callback function. It will be called on each finding. See its API below. |
opts Type: Plain object Obligatory: no | |||
opts | Plain object | no | Optional Options Object. |
None of the input arguments will be mutated by this program, we have unit tests to prove that.
The Optional Options Object has the following shape:
Key | Type | Default | Description |
---|---|---|---|
stringOffset Type: Natural number or zero Default: 0 | |||
stringOffset | Natural number or zero | 0 | Every index fed to the callback will be incremented by this much. |
maxDistance Type: Natural number or zero Default: 1 | |||
maxDistance | Natural number or zero | 1 | Controls, how many characters can differ before we disregard the particular chunk as a result, Levenshtein distance |
ignoreWhitespace Type: Boolean Default: true | |||
ignoreWhitespace | Boolean | true | Whitepace (characters that trim to zero length) is skipped by default. |
Here are all defaults in one place for copying:
The function will return undefined
because it has a callback-style API, same like Array.prototype.forEach()
for example.
API — a callback input argument
The third input argument is a callback function that you supply. When a result is found, this function is called and a plain object is passed to function’s first argument.
For example:
import { findMalformed } from "string-find-malformed";
// we create an empty array to dump the results into
const gathered = [];
// we call the function
findMalformed(
// first input argument: source
"abcdef",
// second input argument: what to look for but mangled
"bde",
// callback function:
(obj) => {
gathered.push(obj);
},
// empty options object:
{}
);
console.log(gathered);
// => [
// {
// idxFrom: 1,
// idxTo: 5
// }
// ]
// you can double-check with String.slice():
console.log(abcdef.slice(1, 5));
// => "bcde"
// it's mangled because rogue letter "c" is between the "good" letters.
The result above means, mangled bde
is present in abcdef
on indexes range from 1
to 5
. The indexes follow the same principles as in String.slice()
.
API — defaults
You can import defaults
:
It's a plain object:
The main function calculates the options to be used by merging the options you passed with these defaults.
API — version
You can import version
:
Further Ideas
Nobody would mistype “owned” as “ewned” — “fat finger” errors occur on vicinity keys, in this case, “o” can be mistyped with “i” or “p” because those keys are near. Key “e” is far, it’s unrealistic.
In this light, Levenshtein distance is not strictly suited for purpose. Alternative version of it should be written, where algorithm considers both distance AND neighbouring keys and evaluates accordingly.