Efficient collapsing of white space with optional outer- and/or line-trimming and HTML tag recognition

§ Quick Take

import { strict as assert } from "assert";
import collapse from "string-collapse-white-space";

  collapse("  aaa     bbb    ccc   dddd  "),
  "aaa bbb ccc dddd"

  collapse("   \t\t\t   aaa   \t\t\t   "),

  collapse("   aaa   bbb  \n    ccc   ddd   ", {
    trimLines: false,
  "aaa bbb \n ccc ddd"

  collapse("   aaa   bbb  \n    ccc   ddd   ", {
    trimLines: true,
  "aaa bbb\nccc ddd"

// \xa0 is an unencoded non-breaking space:
    "     \xa0    aaa   bbb    \xa0    \n     \xa0     ccc   ddd   \xa0   ",
    { trimLines: true, trimnbsp: true }
  "aaa bbb\nccc ddd"

§ Whitespace Collapsing

Take string. First trim the outsides, then collapse two and more spaces into one.

'    aaa    bbbb    ' -> 'aaa bbbb'

When trimming, any whitespace will be collapsed, including tabs, line breaks and so on. When collapsing, only spaces are collapsed. Non-space whitespace within text won't be collapsed.

'   \t\t\t   aaa     \t     bbbb  \t\t\t\t  ' -> 'aaa \t bbbb'

(Optional, on by default) Collapse more aggressively within recognised HTML tags:

'text <   span   >    contents   <  /  span   > more text' -> 'text <span> contents </span> more text'

(Optional, off by default) Trim each line:

'   aaa   \n   bbb   ' -> 'aaa\nbbb'

(Optional, off by default) Delete empty or whitespace-only rows:

'a\n\n\nb' -> 'a\nb'


collapse (string, [opts])


  • the first argument - string only or will throw.
  • the second argument - optional options object. Anything else than undefined, null or a plain object will throw.

§ Optional Options Object's API:

options object's keyTypeObligatory?DefaultDescription
trimStartBooleannotrueif false, leading whitespace will be just collapsed. That might a single space, for example, if there are bunch of leading spaces.
trimEndBooleannotrueif false, trailing whitespace will be just collapsed.
trimLinesBooleannofalseif true, every line will be trimmed (spaces, tabs, line breaks of all kinds will be deleted, also non-breaking spaces, if trimnbsp is set to true)
trimnbspBooleannofalsewhen trimming, do we delete non-breaking spaces (if set to true, answer would be "yes"). This setting also affects trimLines setting above.
recogniseHTMLBooleannotrueif true, the space directly within recognised 118 HTML tag brackets will be collapsed tightly: < div > -> <div>. It will not touch any other brackets such as string a > b.
removeEmptyLinesBooleannofalseif any line can be trimmed to empty string, it will be removed.
returnRangesOnlyBooleannofalseif enabled, ranges array (array of arrays) or null (if there was nothing to collapse) will be returned instead
limitConsecutiveEmptyLinesToNatural number or zerono0Set to 1 or more to allow that many blank lines between content


trimStart: true, // otherwise, leading whitespace will be collapsed to a single space
trimEnd: true, // otherwise, trailing whitespace will be collapsed to a single space
trimLines: false, // activates trim per-line basis
trimnbsp: false, // non-breaking spaces are trimmed too
recogniseHTML: true, // collapses whitespace around HTML brackets
removeEmptyLines: false, // if line trim()'s to an empty string, it's removed
returnRangesOnly: false, // if on, only ranges array is returned
limitConsecutiveEmptyLinesTo: 0 // zero lines are allowed (if opts.removeEmptyLines is on)

§ Algorithm

Traverse the string once, gather a list of ranges indicating white space indexes, delete them all in one go and return the new string.

This library traverses the string only once and performs the deletion only once. It recognises Windows, Unix and Linux line endings.

Optionally (on by default), it can recognise (X)HTML tags (any out of 118) and for example collapse < div..<div...

This algorithm does not use regexes.

§ Smart bits

There are some sneaky false-positive cases, for example:

Equations: a < b and c > d, for example.

Notice the part < b and c > almost matches the HTML tag description - it's wrapped with brackets, starts with legit HTML tag name (one out of 118, for example, b) and even space follows it. The current version of the algorithm will detect false-positives by counting amount of space, equal, double quote and line break characters within suspected tag (string part between the brackets).

The plan is: if there are spaces, this means this suspect tag has got attributes. In that case, there has to be at least one equal sign or equal count of unescaped double quotes. Otherwise, nothing will be collapsed/deleted from that particular tag.

§ Licence

MITopens in a new tab

Copyright © 2010–2020 Roy Revelt and other contributors

Related packages:

📦 detergent 5.11.7
Extracts, cleans and encodes text
📦 string-trim-spaces-only 2.8.23
Like String.trim() but you can choose granularly what to trim
📦 string-range-expander 1.11.11
Expands string index ranges within whitespace boundaries until letters are met
📦 string-remove-thousand-separators 3.0.72
Detects and removes thousand separators (dot/comma/quote/space) from string-type digits
📦 string-find-malformed 1.1.16
Search for a malformed string. Think of Levenshtein distance but in search.
📦 string-remove-duplicate-heads-tails 3.0.73
Detect and (recursively) remove head and tail wrappings around the input string
📦 string-find-heads-tails 3.16.16
Finds where are arbitrary templating marker heads and tails located