Installation
Quick Take
Examples
- Ignores code tags and their contents
- A bypass callback and a do-nothing callback
- Extract HTML
<head>
contents - Retain href and link label
- Leave only HTML
- Leave only opening
td
tags - Leave only
td
tags - Minimal example using Ranges
- Remove all HTML from a string
- Strip HTML from a raw JSON string
- Set the title case using
title
package - Orphan removal from text within HTML
Features
- Non-parsing, so works on HTML mixed with other languages.
- Non-parsing, so works on broken, partial, incomplete or non-valid HTML.
- Attempts to format the output nicely.
- Full control using a callback if needed.
- Can remove or ignore certain tag pairs along with their children tags.
- Can be used to generate Email Text versions —
href
URLs can be retained. - Enabled-by-default but optional Recursive HTML Decoding — nothing will escape!
- It won’t strip templating tags (like JSP).
API — stripHtml()
The main function stripHtml()
is imported like this:
It’s a function which takes two input arguments:
Input argument | Type | Obligatory | Description |
---|---|---|---|
input Type: String Obligatory: yes | |||
input | String | yes | Strip tags from this string. |
opts Type: Plain object Obligatory: no | |||
opts | Plain object | no | Optional Options Object. |
The Optional Options Object has the following shape:
Key | Type | Default | Description |
---|---|---|---|
ignoreTags Type: Array of zero or more strings Default: [] | |||
ignoreTags | Array of zero or more strings | [] | These tags will not be removed |
onlyStripTags Type: Array of zero or more strings Default: [] | |||
onlyStripTags | Array of zero or more strings | [] | If one or more tag names are given here, only these tags will be stripped, nothing else |
ignoreTagsWithTheirContents Type: Array of zero or more strings Default: [] | |||
ignoreTagsWithTheirContents | Array of zero or more strings | [] | Opposite of stripTogetherWithTheirContents |
stripTogetherWithTheirContents Type: Array of zero or more strings, or something falsy Default: ['script', 'style', 'xml'] | |||
stripTogetherWithTheirContents | Array of zero or more strings, or something falsy | ['script', 'style', 'xml'] | These tags will be removed along with their children tags. Set it to something falsy to turn it off. You can set it to ["*"] to strip all tags this way. |
skipHtmlDecoding Type: Boolean Default: false | |||
skipHtmlDecoding | Boolean | false | By default, all HTML entities for example < will be recursively decoded before HTML-stripping. You can turn it off here if you don’t need it and gain performance. |
trimOnlySpaces Type: Boolean Default: false | |||
trimOnlySpaces | Boolean | false | It ensures non-spaces are not trimmed from the outer edges of a string. It’s used when multiple strings are stripped from tags and then concatenated. |
stripRecognisedHTMLOnly Type: Boolean Default: false | |||
stripRecognisedHTMLOnly | Boolean | false | If the input has templating langauge tags, and you wish to retain them, enable this setting. |
dumpLinkHrefsNearby Type: Plain object or something falsy Default: false | |||
dumpLinkHrefsNearby | Plain object or something falsy | false | Used to customise the output of link URL’s: to enable the feature, also customise the URL location and wrapping. |
cb Type: Something falsy or a function Default: null | |||
cb | Something falsy or a function | null | Gives you full control of the output and lets you tweak it. See the dedicated chapter below. |
Here are all defaults in one place for copying:
The function will return a plain object:
Key | Type | Description |
---|---|---|
log Type: Plain object | ||
log | Plain object | For example, { timeTakenInMilliseconds: 6 } |
result Type: String | ||
result | String | The string output where all ranges were applied to it. |
ranges Type: Ranges or null | ||
ranges | Ranges or null | For example, if characters from index 0 to 5 and 30 to 35 were deleted, that would be [[0, 5], [30, 35]] . Another example, if nothing was found, it would put here null . |
allTagLocations Type: Array of zero or more arrays | ||
allTagLocations | Array of zero or more arrays | For example, [[0, 5], [30, 35]] . If you String.slice() each pair, you’ll get HTML tag values. |
filteredTagLocations Type: Array of zero or more arrays | ||
filteredTagLocations | Array of zero or more arrays | Only the tags that ended up stripped will be reported here. Takes into account opts.ignoreTags and opts.onlyStripTags , unlike allTagLocations above. For example, [[0, 5], [30, 35]] . |
Using Ranges from the output
The ranges from the output are compatible with range-ecosystem libraries, see example. Behind the scenes, this program actually operates on Ranges.
opts.trimOnlySpaces
Hi
→Hi
instead ofHi
→Hi
The trailing whitespace can be rogue but it can be intentional. It’s like shreds in jeans. So, to mark the intention, people use non-breaking spaces around the string. Also in this context, line breaks, tabs and other whitespace characters are concerned too.
When this setting is turned on, only spaces will be trimmed from outside; an algorithm will stop at a first non-space character, in this case, non-breaking space:
" Hi! Please <div>shop now</div>! "
is turned into:
" Hi! Please shop now! "
This setting is disabled by default.
opts.dumpLinkHrefsNearby
The purpose of this option is to retain link URLs:
Watch both <a href="https://www.cnn.com" target="_blank">CNN</a> and
<a href="https://www.bbc.co.uk" target="_blank">BBC</a>.
could be turned into:
Watch both CNN https://www.cnn.com and BBC https://www.bbc.co.uk.
The opts.dumpLinkHrefsNearby
value is a plain object, for example:
Key | Default | Description |
---|---|---|
enabled Default: false | ||
enabled | false | By default, this function is disabled — URL’s are not inserted nearby. Set it to Boolean true to enable it. |
putOnNewLine Default: false | ||
putOnNewLine | false | By default, URL is inserted after any whatever was left after stripping the particular linked piece of code. If you want, you can force all inserted URL’s to be on a new line, separated by a blank line. |
wrapHeads Default: "" | ||
wrapHeads | "" | This string (default is an empty string) will be inserted in front of every URL. Set it to any string you want, for example [ . |
wrapTails Default: "" | ||
wrapTails | "" | This string (default is an empty string) will be inserted straight after every URL. Set it to any string you want, for example ] . |
This feature is aimed at producing Text versions for promotional or transactional email campaigns.
But equally, any link on any tag, even one without text, will be retained:
Codsen
<div>
<a href="https://codsen.com" target="_blank"
><img
src="logo.png"
width="100"
height="100"
border="0"
style="display:block;"
alt="Codsen logo"
/></a>
</div>
it’s turned into:
Codsen https://codsen.com
This setting is disabled by default.
opts.stripTogetherWithTheirContents
This setting is enabled by default and set to strip tags: ['script', 'style', 'xml']
along with all their contents.
TIP: You can use asterisks, for example, ["custom-tag-name-*"]
or even ["*"]
(which would strip all paired tags along their contents, everything from div
to table
).
opts.cb
Sometimes you want more control over the program: maybe you want to strip only certain tags and write your custom conditions, maybe you want to do something extra on tags which are being ignored, for example, fix whitespace within them?
You can do it using opts.cb
, passing a callback function. The idea is, once the program detects a truthy callback, it will stop performing the actions automatically. Instead, it will give you all the data: tag
object with tag details, proposed deletion ranges, proposed string to insert and so on — then you must push the range yourself into rangesArr
. If you don’t push anything, that tag won’t be deleted.
const cb = ({
tag,
deleteFrom,
deleteTo,
insert,
rangesArr,
proposedReturn,
}) => {
if (tag) {
// do something depending on what's in the current tag
console.log(JSON.stringify(tag, null, 4));
} else {
// default action which does nothing different from normal, non-callback operation
rangesArr.push(deleteFrom, deleteTo, insert);
// you might want to do something different, depending on "tag" contents.
}
};
const { result } = stripHtml("abc<hr>def", { cb });
console.log(result);
The tag
key contains all the internal data for the particular tag which is being removed. Feel free to console.log(JSON.stringify(tag, null, 4))
it and tap its contents.
cb()
example
The point of this callback interface is to hand over the final decision making to a user (you). The program will suggest you what it would push to final ranges array rangesArr
, but it’s up to you to perform the pushing.
Here’s an example where the callback “does nothing” — it pushes what is proposed by default, proposedReturn
.
From here, you can add more logic, conditionally push only certain ranges, tweak the ranges that get pushed and so on.
The tag
key contains all the info program has gathered for currently stripped tag, it looks like this:
{
"attributes": [],
"slashPresent": false,
"leftOuterWhitespace": 3,
"onlyPlausible": false,
"nameStarts": 4,
"nameContainsLetters": true,
"nameEnds": 6,
"name": "hr",
"lastOpeningBracketAt": 3,
"lastClosingBracketAt": 6
}
For example, strict bracket-to-bracket range would be [tag.lastOpeningBracketAt, tag.lastClosingBracketAt + 1]
.
API — defaults
You can import defaults
:
It's a plain object:
The main function calculates the options to be used by merging the options you passed with these defaults.
API — version
You can import version
:
Algorithm
Speaking scientifically, it works from lexer-level, it’s a scanerless parser.
In simple language, this program does not use parsing and AST trees. It processes the input string as text. Whatever the algorithm doesn’t understand — errors, broken code, non-HTML, etc. — it skips.