string-strip-html open source npm package

Installation

Choose the installation type:

Quick Take

Examples

Open string-strip-html playground

Features

Non-parsing, so works on HTML mixed with other languages.
Non-parsing, so works on broken, partial, incomplete or non-valid HTML.
Attempts to format the output nicely.
Full control using a callback if needed.
Can remove or ignore certain tag pairs along with their children tags.
Can be used to generate Email Text versions — href URLs can be retained.
Enabled-by-default but optional Recursive HTML Decoding — nothing will escape!
It won’t strip templating tags (like JSP).

API — `stripHtml()`

The main function stripHtml() is imported like this:

It’s a function which takes two input arguments:

Input argument	Type	Obligatory	Description
	`input` Type: String Obligatory: yes
`input`	String	yes	Strip tags from this string.

	`opts` Type: Plain object Obligatory: no
`opts`	Plain object	no	Optional Options Object.

The Optional Options Object has the following shape:

Key	Type	Default	Description
	`ignoreTags` Type: Array of zero or more strings Default: `[]`
`ignoreTags`	Array of zero or more strings	`[]`	These tags will not be removed

	`onlyStripTags` Type: Array of zero or more strings Default: `[]`
`onlyStripTags`	Array of zero or more strings	`[]`	If one or more tag names are given here, only these tags will be stripped, nothing else

	`ignoreTagsWithTheirContents` Type: Array of zero or more strings Default: `[]`
`ignoreTagsWithTheirContents`	Array of zero or more strings	`[]`	Opposite of `stripTogetherWithTheirContents`

	`stripTogetherWithTheirContents` Type: Array of zero or more strings, or something falsy Default: `['script', 'style', 'xml']`
`stripTogetherWithTheirContents`	Array of zero or more strings, or something falsy	`['script', 'style', 'xml']`	These tags will be removed along with their children tags. Set it to something falsy to turn it off. You can set it to `["*"]` to strip all tags this way.

	`skipHtmlDecoding` Type: Boolean Default: `false`
`skipHtmlDecoding`	Boolean	`false`	By default, all HTML entities for example `<` will be recursively decoded before HTML-stripping. You can turn it off here if you don’t need it and gain performance.

	`trimOnlySpaces` Type: Boolean Default: `false`
`trimOnlySpaces`	Boolean	`false`	It ensures non-spaces are not trimmed from the outer edges of a string. It’s used when multiple strings are stripped from tags and then concatenated.

	`stripRecognisedHTMLOnly` Type: Boolean Default: `false`
`stripRecognisedHTMLOnly`	Boolean	`false`	If the input has templating langauge tags, and you wish to retain them, enable this setting.

	`dumpLinkHrefsNearby` Type: Plain object or something falsy Default: `false`
`dumpLinkHrefsNearby`	Plain object or something falsy	`false`	Used to customise the output of link URL’s: to enable the feature, also customise the URL location and wrapping.

	`cb` Type: Something falsy or a function Default: `null`
`cb`	Something falsy or a function	`null`	Gives you full control of the output and lets you tweak it. See the dedicated chapter below.

Here are all defaults in one place for copying:

The function will return a plain object:

Key	Type	Description
	`log` Type: Plain object
`log`	Plain object	For example, `{ timeTakenInMilliseconds: 6 }`

	`result` Type: String
`result`	String	The string output where all ranges were applied to it.

	`ranges` Type: Ranges or `null`
`ranges`	Ranges or `null`	For example, if characters from index `0` to `5` and `30` to `35` were deleted, that would be `[[0, 5], [30, 35]]`. Another example, if nothing was found, it would put here `null`.

	`allTagLocations` Type: Array of zero or more arrays
`allTagLocations`	Array of zero or more arrays	For example, `[[0, 5], [30, 35]]`. If you `String.slice()` each pair, you’ll get HTML tag values.

	`filteredTagLocations` Type: Array of zero or more arrays
`filteredTagLocations`	Array of zero or more arrays	Only the tags that ended up stripped will be reported here. Takes into account `opts.ignoreTags` and `opts.onlyStripTags`, unlike `allTagLocations` above. For example, `[[0, 5], [30, 35]]`.

Using Ranges from the output

The ranges from the output are compatible with range-ecosystem libraries, see example. Behind the scenes, this program actually operates on Ranges.

`opts.trimOnlySpaces`

Hi  → Hi  instead of Hi  → Hi

The trailing whitespace can be rogue but it can be intentional. It’s like shreds in jeans. So, to mark the intention, people use non-breaking spaces around the string. Also in this context, line breaks, tabs and other whitespace characters are concerned too.

When this setting is turned on, only spaces will be trimmed from outside; an algorithm will stop at a first non-space character, in this case, non-breaking space:

"      &nbsp;     Hi! Please <div>shop now</div>!      &nbsp;      "

is turned into:

"&nbsp;     Hi! Please shop now!      &nbsp;"

This setting is disabled by default.

`opts.dumpLinkHrefsNearby`

The purpose of this option is to retain link URLs:

Watch both <a href="https://www.cnn.com" target="_blank">CNN</a> and
<a href="https://www.bbc.co.uk" target="_blank">BBC</a>.

could be turned into:

Watch both CNN https://www.cnn.com and BBC https://www.bbc.co.uk.

The opts.dumpLinkHrefsNearby value is a plain object, for example:

Key	Default	Description
	`enabled` Default: `false`
`enabled`	`false`	By default, this function is disabled — URL’s are not inserted nearby. Set it to Boolean `true` to enable it.

	`putOnNewLine` Default: `false`
`putOnNewLine`	`false`	By default, URL is inserted after any whatever was left after stripping the particular linked piece of code. If you want, you can force all inserted URL’s to be on a new line, separated by a blank line.

	`wrapHeads` Default: `""`
`wrapHeads`	`""`	This string (default is an empty string) will be inserted in front of every URL. Set it to any string you want, for example `[`.

	`wrapTails` Default: `""`
`wrapTails`	`""`	This string (default is an empty string) will be inserted straight after every URL. Set it to any string you want, for example `]`.

This feature is aimed at producing Text versions for promotional or transactional email campaigns.

But equally, any link on any tag, even one without text, will be retained:

Codsen
<div>
  <a href="https://codsen.com" target="_blank"
    ><img
      src="logo.png"
      width="100"
      height="100"
      border="0"
      style="display:block;"
      alt="Codsen logo"
  /></a>
</div>

it’s turned into:

Codsen https://codsen.com

This setting is disabled by default.

`opts.stripTogetherWithTheirContents`

This setting is enabled by default and set to strip tags: ['script', 'style', 'xml'] along with all their contents.

TIP: You can use asterisks, for example, ["custom-tag-name-*"] or even ["*"] (which would strip all paired tags along their contents, everything from div to table).

`opts.cb`

Sometimes you want more control over the program: maybe you want to strip only certain tags and write your custom conditions, maybe you want to do something extra on tags which are being ignored, for example, fix whitespace within them?

You can do it using opts.cb, passing a callback function. The idea is, once the program detects a truthy callback, it will stop performing the actions automatically. Instead, it will give you all the data: tag object with tag details, proposed deletion ranges, proposed string to insert and so on — then you must push the range yourself into rangesArr. If you don’t push anything, that tag won’t be deleted.

const cb = ({
  tag,
  deleteFrom,
  deleteTo,
  insert,
  rangesArr,
  proposedReturn,
}) => {
  if (tag) {
    // do something depending on what's in the current tag
    console.log(JSON.stringify(tag, null, 4));
  } else {
    // default action which does nothing different from normal, non-callback operation
    rangesArr.push(deleteFrom, deleteTo, insert);
    // you might want to do something different, depending on "tag" contents.
  }
};
const { result } = stripHtml("abc<hr>def", { cb });
console.log(result);

The tag key contains all the internal data for the particular tag which is being removed. Feel free to console.log(JSON.stringify(tag, null, 4)) it and tap its contents.

`cb()` example

The point of this callback interface is to hand over the final decision making to a user (you). The program will suggest you what it would push to final ranges array rangesArr, but it’s up to you to perform the pushing.

Here’s an example where the callback “does nothing” — it pushes what is proposed by default, proposedReturn.

From here, you can add more logic, conditionally push only certain ranges, tweak the ranges that get pushed and so on.

The tag key contains all the info program has gathered for currently stripped tag, it looks like this:

{
  "attributes": [],
  "slashPresent": false,
  "leftOuterWhitespace": 3,
  "onlyPlausible": false,
  "nameStarts": 4,
  "nameContainsLetters": true,
  "nameEnds": 6,
  "name": "hr",
  "lastOpeningBracketAt": 3,
  "lastClosingBracketAt": 6
}

For example, strict bracket-to-bracket range would be [tag.lastOpeningBracketAt, tag.lastClosingBracketAt + 1].

API — `defaults`

You can import defaults:

It's a plain object:

The main function calculates the options to be used by merging the options you passed with these defaults.

API — `version`

You can import version:

Algorithm

Speaking scientifically, it works from lexer-level, it’s a scanerless parser.

In simple language, this program does not use parsing and AST trees. It processes the input string as text. Whatever the algorithm doesn’t understand — errors, broken code, non-HTML, etc. — it skips.

Changelog

Open Changelog

string-strip-html13.0.3

Installation

Quick Take

Examples

Features

API — stripHtml()

Using Ranges from the output

opts.trimOnlySpaces

opts.dumpLinkHrefsNearby

opts.stripTogetherWithTheirContents

opts.cb

cb() example

API — defaults

API — version