§ Quick Take

import { strict as assert } from "assert";
import splitByW from "string-split-by-whitespace";

// Split by whitespace is easy - use native String.prototype.split()
assert.deepEqual("abc  def ghi".split(/\s+/), [

const source = `\n     \n    a\t \nb    \n      \t`;

// this program is nearly equivalent to regex-based split:
assert.deepEqual(source.split(/\s+/), [
assert.deepEqual(splitByW(source), ["a", "b"]);
// regex-based split needs more filtration but it's native solution


// this program allows to exclude certain index ranges:
  splitByW("a b c d e", {
    ignoreRanges: [[0, 2]], // that's "a" and space after it
  ["b", "c", "d", "e"]

§ Purpose

When String.split(/\s+/) is not enough, for example, when you need to exclude certain substrings, this program will help.

It splits the string by whitespace — definition of "whitespace" being "anything that trims to zero-length" — that's tabs, line breaks (CR and LF), space character and raw non-breaking space. There are quite few Unicode characters across the whole Unicode range.


splitByW(str, [opts])

In other words, it's a function which takes two input arguments, second-one being optional (marked by square brackets).

§ API - Input

Input argumentTypeObligatory?Description
strStringyesSource string upon which to perform the operation
optsPlain objectnoOptional Options Object, see below for its API

§ An Optional Options Object

Optional Options Object's keyType of its valueDefaultDescription
ignoreRangesArray of zero or more range arrays[]Feed zero or more string slice ranges, arrays of two natural number indexes, like [[1, 5], [6, 10]]. Algorithm will not include these string index ranges in the results.

The opts.ignoreRanges can be an empty array, but if it contains anything else then arrays inside, error will be thrown.

§ API - Output

Program returns array of zero or more strings. Empty string yields empty array.

§ opts.ignoreRanges

Some basics first. When we say "heads" or "tails", we mean some templating literals that wrap a value. "heads" is frontal part, for example {{ below, "tails" is ending part, for example }} below:

Hi {{ firstName }}!

Now imagine that we extracted heads and tails and we know their ranges: [[3, 5], [16, 18]]. (If you select {{ and }} from in front of "Hi" to where each head and tail starts and ends, you'll see that these numbers match).

Now, imagine, we want to split Hi {{ firstName }}! into array ["Hi", "firstname", "!"].

For that we need to skip two ranges, those of a head and tail.

That's where opts.ignoreRanges become handy.

In example below, we used library string-find-heads-tails to extract the ranges of variables' heads and tails in a string, then split by whitespace:

const input = "some interesting {{text}} {% and %} {{ some more }} text.";
const headsAndTails = strFindHeadsTails(
["{{", "{%"],
["}}", "%}"]
).reduce((acc, curr) => {
acc.push([curr.headsStartAt, curr.headsEndAt]);
acc.push([curr.tailsStartAt, curr.tailsEndAt]);
return acc;
}, []);
const res1 = split(input, {
ignoreRanges: headsAndTails,
console.log(`res1 = ${JSON.stringify(res1, null, 4)}`);
// => ['some', 'interesting', 'text', 'and', 'some', 'more', 'text.']

You can ignore whole variables, from heads to tails, including variable's names:

const input = "some interesting {{text}} {% and %} {{ some more }} text.";
const wholeVariables = strFindHeadsTails(
["{{", "{%"],
["}}", "%}"]
).reduce((acc, curr) => {
acc.push([curr.headsStartAt, curr.tailsEndAt]);
return acc;
}, []);
const res2 = split(input, {
ignoreRanges: wholeVariables,
// => ['some', 'interesting', 'text.']

We need to perform the array.reduce to adapt to the string-find-heads-tails output, which is in format (index numbers are only examples):

headsStartAt: ...,
headsEndAt: ...,
tailsStartAt: ...,
tailsEndAt: ...,

and with the help of array.reduce we turn it into our format:

(first example with res1)

[headsStartAt, headsEndAt],
[tailsStartAt, tailsEndAt],

(second example with res2)

[headsStartAt, tailsEndAt],

§ Changelog

See it in the monorepo opens in a new tab, on Sourcehut.

§ Licence

MIT opens in a new tab

Copyright © 2010–2020 Roy Revelt and other contributors

Related packages:

📦 detergent 6.1.1
Extracts, cleans and encodes text
📦 string-uglify 1.3.4
Shorten sets of strings deterministically, to be git-friendly
📦 string-fix-broken-named-entities 4.0.1
Finds and fixes common and not so common broken named HTML entities, returns ranges array of fixes
📦 string-convert-indexes 3.0.1
Convert between native JS string character indexes and grapheme-count-based indexes
📦 string-left-right 3.0.1
Looks up the first non-whitespace character to the left/right of a given index
📦 string-overlap-one-on-another 1.6.0
Lay one string on top of another, with an optional offset
📦 string-match-left-right 5.0.0
Match substrings on the left or right of a given index, ignoring whitespace