§ Quick Take

import { strict as assert } from "assert";
import { det, opts, version } from "detergent";

// on default setting, widow removal and encoding are enabled:
assert.equal(
  det("clean this text £").res,
  "clean this text £"
);

§ Purpose

Detergent.js:

Extra features are:

  • You can skip the HTML encoding of non-Latin language letters. Useful when you are deploying Japanese or Chinese emails because otherwise, everything would be HTML-encoded.
  • Detergent is both XHTML and HTML-friendly. You can set which way you want your <br>'s to appear: with a closing slash, <br/> (XHTML) or without (HTML), <br> — that's to reduce code validator errors.

§ API

The main function is exported in a plain object under key det:

const { det } = require("detergent");
// or request everything:
const { det, opts, version } = require("detergent");
// this gives extra plain object `opts` with default options. Handy when
// developing front-ends that consume the Detergent.

det is the main function. See its API below.

opts is default options' object. You pass it (or its tweaked version) to det.

version returns same-named package.json key's value - the version of the particular copy of Detergent you've got.

§ API - det() Input

The det above is a function. You pass two input arguments to it:

Input argumentTypeObligatory?Description
inputStringyesThe string you want to clean.
optionsObjectnoOptions object. See its key arrangement below.

§ API - det() second input argument, the options object

Options object's keyType of its valueDefaultDescription
fixBrokenEntitiesBooleanTrueshould we try to fix any broken named HTML entities like &nsp; ("b" missing)
removeWidowsBooleanTruereplace the last space in paragraph with a non-breaking space
convertEntitiesBooleanTrueencode all non-ASCIIopens in a new tab chars
convertDashesBooleanTruetypographically-correct the n/m-dashes
convertApostrophesBooleanTruetypographically-correct the apostrophes
replaceLineBreaksBooleanTruereplace all line breaks with br's
removeLineBreaksBooleanFalseput everything on one line (removes any line breaks, inserting space where necessary)
useXHTMLBooleanTrueadd closing slashes on br's
dontEncodeNonLatinBooleanTrueskip non-latin character encoding (for example, CJKopens in a new tab, Alefbet Ivri or Arabic abjad)
addMissingSpacesBooleanTrueadds missing spaces after dots/colons/semicolons, unless it's an URL
convertDotsToEllipsisBooleanTrueconvert three dots into &hellip; - ellipsis character. When set to false, all encoded ellipses will be converted to three dots.
stripHtmlBooleanTrueby default, all HTML tags are stripped (with exception to opts.keepBoldEtc - option to ignore b, strong and other tags). You can turn off HTML tag removal completely here.
stripHtmlButIgnoreTagsArray["b", "strong", "i", "em", "br", "sup"]List zero or more strings, each meaning a tag name that should not be stripped. For example, ["a", "sup"].
stripHtmlAddNewLineArray["li", "/ul"]List of zero or more tag names which, if stripped, are replaced with a line break. Closing tags must start with slash.
cbsomething falsy or a functionnullCallback function to additionally process characters between tags (like turning letters uppercase)

Here it is in one place:

det("text to clean", {
fixBrokenEntities: true,
removeWidows: true,
convertEntities: true,
convertDashes: true,
convertApostrophes: true,
replaceLineBreaks: true,
removeLineBreaks: false,
useXHTML: true,
dontEncodeNonLatin: true,
addMissingSpaces: true,
convertDotsToEllipsis: true,
stripHtml: true,
stripHtmlButIgnoreTags: ["b", "strong", "i", "em", "br", "sup"],
stripHtmlAddNewLine: ["li", "/ul"],
cb: null,
});

The default set is a wise choice for the most common scenario - preparing text to be pasted into HTML.

You can also set the options to numeric 0 or 1, that's shorter than Boolean true or false.

§ API - det() output - an object

output object's keyType of its valueDescription
resStringThe cleaned string
applicableOptsPlain ObjectCopy of options object without keys that have array values, each set to boolean, is that function applicable to given input

Function det returns a plain object, for example:

{
res: "abc",
applicableOpts: {
fixBrokenEntities: false,
removeWidows: false,
convertEntities: false,
convertDashes: false,
convertApostrophes: false,
replaceLineBreaks: false,
removeLineBreaks: false,
useXHTML: false,
dontEncodeNonLatin: false,
addMissingSpaces: false,
convertDotsToEllipsis: false,
stripHtml: false
}
}

§ applicableOpts

Next generation web applications are designed to show only the options that are applicable to the given input. This saves user's time and also conserves mental resources — you don't even need to read all the labels of the options if they are not applicable.

Detergent currently has 14 option keys, 12 of them boolean. That's not a lot but if you use the tool every day, every optimisation counts.

We got inspiration for this feature while visiting competitor application typografopens in a new tab — it has 110 checkboxes grouped into 12 groups and options are hidden twice — first sidebar is hidden when you visit the page, second, option groups are collapsed.

Another example of overwhelming options set — Kangax minifier — html-minifieropens in a new tab — it's got 26 options with heavy descriptions.

Detergent tackles this problem by changing its algorithm: it processes the given input and then makes a note, is particular option applicable or not, independently, is it enabled or not. Then, if it's enabled, it changes the result value.

For example, detergent's output might look like this — all options not applicable because there's nothing to do on "abc":

{
res: "abc",
applicableOpts: {
fixBrokenEntities: false,
removeWidows: false,
convertEntities: false,
convertDashes: false,
convertApostrophes: false,
replaceLineBreaks: false,
removeLineBreaks: false,
useXHTML: false,
dontEncodeNonLatin: false,
addMissingSpaces: false,
convertDotsToEllipsis: false,
stripHtml: false
}
}

The options keys which have values of a type array (stripHtmlButIgnoreTags and stripHtmlAddNewLine) are omitted from applicableOpts report.

§ Example

Custom settings object with one custom setting convertEntities (others are left default):

const { det } = require("detergent");
let { res } = det("clean this text £", {
convertEntities: 0, // <--- zero is like "false", turns off the feature
});
console.log(res);
// > 'clean this text £'

§ opts.cb

One of the unique (and complex) features of this program is HTML tag recognition. We process only the text and don't touch the tags, for example, widow word removal won't add non-breaking spaces within your tags if you choose not to strip the HTML.

opts.cb lets you perform additional operations on all the string characters outside any HTML tags. For example, detergent.ioopens in a new tab uppercase-lowercase functionality relies on opts.cb.

Here's an example, consider this case — HTML tags skipped when turning letters uppercase:

const { det } = require("detergent");
const { res } = det(`aAa\n\nbBb\n\ncCc`, {
cb: (str) => str.toUpperCase(),
});
console.log(res);
// => "AAA<br/>\n<br/>\nBBB<br/>\n<br/>\nCCC"

§ Licence

MITopens in a new tab

Copyright © 2015–2020 Roy Revelt and other contributors

Related articles:

Related packages:

📦 html-entities-not-email-friendly 0.2.8
All HTML entities which are not email template friendly
📦 string-apostrophes 1.2.29
Comprehensive, HTML-entities-aware tool to typographically-correct the apostrophes and single/double quotes
📦 string-collapse-white-space 5.2.30
Efficient collapsing of white space with optional outer- and/or line-trimming and HTML tag recognition
📦 string-fix-broken-named-entities 3.0.10
Finds and fixes common and not so common broken named HTML entities, returns ranges array of fixes
📦 string-left-right 2.3.30
Looks up the first non-whitespace character to the left/right of a given index
📦 string-remove-widows 1.6.16
Helps to prevent widow words in a text
📦 string-strip-html 6.0.3
Strips HTML tags from strings. No parser, accepts mixed sources.