string-remove-widows open source npm package

Installation

Choose the installation type:

Quick Take

Examples

Widow word removal from text within HTML

Idea

This library takes a string and removes widow words, by replacing last space in the paragraph with non-breaking space.

Not just adds but if want, removes widow word prevention measures
Tackles both paragraphs and single lines
Recognises existing measures and if found, skips operation
Option to encode for HTML, CSS or JS strings or put a raw non-breaking space
Does not mangle the line endings (Mac LF, Old style CR or Windows-style CR LF)
A customisable minimum amount of words per line/paragraph to trigger widow word removal
Can be used in different stages of the workflow: before HTML/CSS/JS-encoding or after
Optionally replaces spaces with non-breaking spaces in front of all kinds of dashes
Optionally replaces spaces with non-breaking spaces within UK postcodes
Optionally it can skip content between templating tags, for example, Nunjucks {{ and }} — presets are given for Jinja, Nunjucks, Liquid, Hexo and Hugo

API — `removeWidows()`

The main function removeWidows() is imported like this:

It’s a function which takes three input arguments:

Input argument	Type	Obligatory	Description
	`str` Type: String Obligatory: yes
`str`	String	yes	String which we will process

	`opts` Type: Plain object Obligatory: no
`opts`	Plain object	no	Put options here

The Optional Options Object has the following shape:

Key	Type	Default	Description
	`removeWidowPreventionMeasures` Type: boolean Default: `false`
`removeWidowPreventionMeasures`	boolean	`false`	If it’s `true`, it will replace all widow word nbsp locations, with a single space

	`convertEntities` Type: boolean Default: `true`
`convertEntities`	boolean	`true`	If it’s `false`, raw non-breaking space is inserted. If `true`, encoded in particular language (default HTML)

	`targetLanguage` Type: string Default: `html`
`targetLanguage`	string	`html`	Choose out of `html`, `css` or `js` — non-breaking spaces will be encoded in this language

	`UKPostcodes` Type: boolean Default: `false`
`UKPostcodes`	boolean	`false`	If enabled, every whitespace between two parts of UK postcodes will be replaced with non-breaking space

	`hyphens` Type: boolean Default: `true`
`hyphens`	boolean	`true`	Whitespace in front of dashes (`-`), n-dashes (`–`) or m-dashes (`—`) will be replaced with a non-breaking space

	`minWordCount` Type: natural number, `0` (disables feature), falsy thing (disables feature) Default: `4`
`minWordCount`	natural number, `0` (disables feature), falsy thing (disables feature)	`4`	Minimum word count on a paragraph to trigger widow removal

	`minCharCount` Type: natural number, `0` (disables feature), falsy thing (disables feature) Default: `20`
`minCharCount`	natural number, `0` (disables feature), falsy thing (disables feature)	`20`	Minimum non-whitespace character count on a paragraph to trigger widow removal

	`ignore` Type: array of zero or more strings OR string Default: `[]`
`ignore`	array of zero or more strings OR string	`[]`	List templating languages whose heads/tails will be recognised and skipped

	`reportProgressFunc` Type: function or `null` Default: `null`
`reportProgressFunc`	function or `null`	`null`	If function is given, it will be pinged a natural number, for each percentage-done (in its first input argument)

	`reportProgressFuncFrom` Type: natural number or `0` Default: `0`
`reportProgressFuncFrom`	natural number or `0`	`0`	Normally `reportProgressFunc()` reports percentages starting from zero, but you can set it to a custom value

	`reportProgressFuncTo` Type: natural number Default: `100`
`reportProgressFuncTo`	natural number	`100`	Normally `reportProgressFunc()` reports percentages up to `100`, but you can set it to a custom value

	`tagRanges` Type: array of zero or more arrays Default: `[]`
`tagRanges`	array of zero or more arrays	`[]`	If you know where the HTML tags are, provide string index ranges here

Here are all defaults in one place for copying:

The function will return a plain object (Res type above):

Key in a returned object	Type	Description
	`res` Type: String
`res`	String	Processed string

	`ranges` Type: Null or Array of one or more Ranges (arrays)
`ranges`	Null or Array of one or more Ranges (arrays)	Same Ranges used to produce the `res`

	`log` Type: Plain object
`log`	Plain object	See its format below

	`whatWasDone` Type: Plain object
`whatWasDone`	Plain object	Was it widow removal or just decoding performed ?

for example, here’s how the output could look like:

{
  res: "Lorem ipsum dolor sit&nbsp;amet",
  ranges: [
    [21, 27, "&nbsp;"]
  ],
  log: {
    timeTakenInMilliseconds: 42
  },
  whatWasDone: {
    removeWidows: true,
    convertEntities: false
  }
}

API — `defaults`

You can import defaults:

It's a plain object:

The main function calculates the options to be used by merging the options you passed with these defaults.

API — `version`

You can import version:

`opts.targetLanguage`

Not all text ends up in HTML. As you know, you can inject the content via CSS pseudo attributes and also text might be prepared to be pasted into JSON.

This program allows you to customise the target encoding for chosen language: html, css or js.

Here’s an HTML with HTML-encoded non-breaking space:

Some raw text in a very long&nbsp;line.

Here’s CSS analogue:

span:before {
  content: "Some raw text in a very long\00A0line.";
}

Here’s JavaScript analogue:

alert("Some raw text in a very long\u00A0line.");

For example, a minimal application would look like this:

import { removeWidows } from "string-remove-widows";
// second input argument is a plain object, the Optional Options Object:
const result = removeWidows("Some raw text in a very long line.", {
  targetLanguage: "css",
});
// now the widow words will be prevented considering that content will go to CSS content:
console.log(result);
// => "Some raw text in a very long\00A0line."

`opts.ignore`

Very often text already contains templating language literals.

For example, this Nunjucks snippet:

Hi{% if data.firstName %} data.firstName{% endif %}!

We intend to either say Hi John! to customer John or just Hi! if we don’t know the customer’s name.

But if we run widow words removal on this piece of text, we don’t want   inserted into the middle of endif:

Hi{% if data.firstName %} data.firstName{% endif&nbsp;%}!
                                                ^^^^^^

That’s where opts.ignore comes in. You can list heads/tails (chunks from which to start ignoring/where to stop) manually:

import { removeWidows } from "string-remove-widows";
const result = removeWidows("Here is a very long line of text", {
  targetLanguage: "html",
  ignore: [
    {
      heads: "{{",
      tails: "}}",
    },
    {
      heads: ["{% if", "{%- if"],
      tails: ["{% endif", "{%- endif"],
    },
  ],
});

or you can just pick a template:

all
jinja
nunjucks
liquid
hugo
hexo

for example:

import { removeWidows } from "string-remove-widows";
const result = removeWidows("Here is a very long line of text", {
  targetLanguage: "html",
  ignore: "jinja",
});

If you want widest support of literals, all languages at once, put “all”.

`opts.tagRanges`

Sometimes input string can contain HTML tags. We didn’t go that far as to code up full HTML tag recognition, more so that such thing would duplicate already existing libraries, namely, string-strip-html.

opts.tagRanges accepts known HTML tag ranges (or, in fact, any “black spots” to skip):

import { stripHtml } from "string-strip-html";
import { removeWidows } from "string-remove-widows";

const input = `something in front here <a style="display: block;">x</a> <b style="display: block;">y</b>`;
// first, gung-ho approach - no tag locations provided:
const res1 = removeWidows(input).res;
console.log(res1);
// => something in front here <a style="display: block;">x</a> <b style="display:&nbsp;block;">y</b>
//                                                                               ^^^^^^
//                                      notice how non-breaking space is wrongly put inside the tag
//
// but, if you provide the tag ranges, program works correctly:
const tagRanges = stripHtml(input, { returnRangesOnly: true });
console.log(JSON.stringify(knownHTMLTagRanges, null, 4));
// => [[24, 51], [52, 56], [57, 84], [85, 89]]
// now, plug the tag ranges into opts.tagRanges:
const res2 = removeWidows(input, { tagRanges }).res;
console.log(res2);
// => something in front here <a style="display: block;">x</a>&nbsp;<b style="display: block;">y</b>

Compared to competition

	This program, `string-remove-widows`	`widow-js`	`@simmo/widower`



	Can both add and remove `nbsp`s
Can both add and remove `nbsp`s	✅	❌	❌

	Option to choose between raw, HTML, CSS or JS-encoded `nbsp`s
Option to choose between raw, HTML, CSS or JS-encoded `nbsp`s	✅	❌	❌

	Can replace spaces in front of hyphens, n- and m-dashes
Can replace spaces in front of hyphens, n- and m-dashes	✅	❌	❌

	Can prepare UK postcodes
Can prepare UK postcodes	✅	❌	❌

	Does not mangle different types of line endings (`LF`, `CRLF`, `CR`)
Does not mangle different types of line endings (`LF`, `CRLF`, `CR`)	✅	✅	✅

	Customisable minimal word count threshold
Customisable minimal word count threshold	✅	✅	❌

	Customisable minimal character count threshold
Customisable minimal character count threshold	✅	❌	❌

	Progress reporting function for web worker web apps
Progress reporting function for web worker web apps	✅	❌	❌

	Reports string index ranges of what was done
Reports string index ranges of what was done	✅	❌	❌

	Non-breaking space location’s whitespace does not necessarily have to be a single space
Non-breaking space location’s whitespace does not necessarily have to be a single space	✅	❌	❌

	Presets for Jinja, Nunjucks, Liquid, Hugo and Hexo templating languages
Presets for Jinja, Nunjucks, Liquid, Hugo and Hexo templating languages	✅	❌	❌

	Decoupled API^
Decoupled API^	✅	❌	✅

	CommonJS build
CommonJS build	✅	❌	✅

	ES Modules build
ES Modules build	✅	❌	❌

	UMD build for browser
UMD build for browser	✅	✅	❌

	Can process live DOM of a web page
Can process live DOM of a web page	❌	✅	❌

	Licence
Licence	MIT	ISC	MIT

^ A decoupled API means that at its core, the program is a function ”string-in, string-out“ and is not coupled with DOM, file I/O, network or other unrelated operations. Such API makes it easier to test and create many different applications on top of a decoupled API.

For example, our competitor widow.js has two coupled parts: 1. API which does string-in, string-out, and 2. DOM processing functions. It could have been two npm libraries. In the end, people who don’t need DOM operations can’t use it.

One decoupled, ”string-in, string-out“ library like string-remove-widows might power all these at once:

Web page DOM-manipulation library
a CLI application to process files or piped streams
an Express REST endpoint on a server,
a serverless lambda on AWS,
an Electron desktop program

Changelog

Open Changelog

string-remove-widows4.0.3

Installation

Quick Take

Examples

Idea

API — removeWidows()

API — defaults

API — version

opts.targetLanguage

opts.ignore

opts.tagRanges