import{ strict as assert }from"assert";import{ stripHtml }from"string-strip-html";
assert.equal(stripHtml(`Some text <b>and</b> text.`).result,`Some text and text.`);// prevents accidental string concatenation
assert.equal(stripHtml(`aaa<div>bbb</div>ccc`).result,`aaa bbb ccc`);// tag pairs with content, upon request
assert.equal(stripHtml(`a <pre><code>void a;</code></pre> b`,{
stripTogetherWithTheirContents:["script",// default"style",// default"xml",// default"pre",// <-- custom-added],}).result,`a b`);// detects raw, legit brackets:
assert.equal(stripHtml(`a < b and c > d`).result,`a < b and c > d`);
Adds or removes the whitespace to make the output presentable.
Removes tag pairs along with the content inside (handy for script).
Works on broken, partial, incomplete, non-valid HTML.
Works on HTML mixed with other languages (because it does not parse).
Can be used to generate Email Text versions. Puts URL links.
It can detect and skip false positives, for example, a < b and c > d.
Enabled-by-default but optional Recursive HTML Decoding — nothing will escape!
It won't strip JSP tags
PS. We have stristri which also strips HTML. It can strip not only HTML but also CSS, text and templating tags. But it has less granular control over whitespace.
The string output where all ranges were applied to it.
ranges
ranges: an array of one or more arrays containing from-to string index ranges OR null
For example, if characters from index 0 to 5 and 30 to 35 were deleted, that would be [[0, 5], [30, 35]]. Another example, if nothing was found, it would put here null.
allTagLocations
Array of zero or more arrays
For example, [[0, 5], [30, 35]]. If you String.slice() each pair, you'll get HTML tag values.
filteredTagLocations
Array of zero or more arrays
Only the tags that ended up stripped will be reported here. Takes into account opts.ignoreTags and opts.onlyStripTags, unlike allTagLocations above. For example, [[0, 5], [30, 35]].
If one or more tag names are given here, only these tags will be stripped, nothing else
stripTogetherWithTheirContents
Array of zero or more strings, or something falsy
['script', 'style', 'xml']
These tags will be removed from the opening tag up to closing tag, including content in-between opening and closing tags. Set it to something falsy to turn it off. You can set it to ["*"] to include all tags.
skipHtmlDecoding
Boolean
false
By default, all escaped HTML entities for example £ input will be recursively decoded before HTML-stripping. You can turn it off here if you don't need it.
trimOnlySpaces
Boolean
false
Used mainly in automated setups. It ensures non-spaces are not trimmed from the outer edges of a string.
dumpLinkHrefsNearby
Plain object or something falsy
false
Used to customise the output of link URL's: to enable the feature, also customise the URL location and wrapping.
cb
Something falsy or a function
null
Gives you full control of the output and lets you tweak it. See the dedicated chapter below.
The Optional Options Object is not validated; please take care of what values and of what type you pass.
Here is the Optional Options Object in one place (in case you ever want to copy it whole):
In automated setups, a single string value can be split over multiple JSON paths. In those cases, joining spaces or non-breaking spaces are intended and often placed around the values. Normally, we would treat surrounding whitespace as a rogue, but not in these cases.
This setting allows us to distinguish between the two cases.
For example, imagine we "stitch" the sentence: Hi John! Welcome to our club. out of three pieces: Hi + John + ! + Welcome to our club.. In this case, spaces between the chunks would be added by your templating engine. Now, imagine, the text is of a quite large font-size, and there's a risk of words wrapping at wrong places. A client asks you to ensure that Hi and John are never split between the lines.
What do you do?
You remove the space between Hi and John from the template and move it to data-level. You hard-code the non-breaking space after Hi — Hi .
As you know, this library trims the input before returning it, and recursive HTML decoding is always on. On default settings, this library would remove your non-breaking space from Hi . That's where you need to set opts.trimOnlySpaces to true.
In this particular case, you can either turn off HTML decoding OR, even better, use this opts.trimOnlySpaces setting.
In either case, whitespace between the detected tags will still be aggressively trimmed - text <div>\n \t \r\n <br>\t \t \t</div> here → text here.
When this setting is on, only spaces will be trimmed from outside; an algorithm will stop at a first non-space character, in this case, non-breaking space:
" Hi! Please <div>shop now</div>! "
is turned into:
" Hi! Please shop now! "
Notice how space chunks between nbsp's and text are retained when opts.trimOnlySpaces is set to true. But the default is false; this feature is off by default.
by default, this function is disabled - URL's are not inserted nearby. Set it to Boolean true to enable it.
putOnNewLine
false
By default, URL is inserted after any whatever was left after stripping the particular linked piece of code. If you want, you can force all inserted URL's to be on a new line, separated by a blank line.
wrapHeads
""
This string (default is an empty string) will be inserted in front of every URL. Set it to any string you want, for example [.
wrapTails
""
This string (default is an empty string) will be inserted straight after every URL. Set it to any string you want, for example ].
This feature is aimed at producing Text versions for promotional or transactional email campaigns.
If input string is has a linked text, URL will be put after it:
We watch both <ahref="https://www.rt.com"target="_blank">RT</a> and <ahref="https://www.bbc.co.uk"target="_blank">BBC</a>.
it's turned into:
We watch both RT https://www.rt.com and BBC https://www.bbc.co.uk.
But equally, any link on any tag, even one without text, will be retained:
Sometimes you want to strip only certain HTML tag or tags. It would be impractical to ignore all other known HTML tags and leave those you want. Option opts.onlyStripTags allows inverting the setting: whatever tags you list will be the only tags removed.
opts.onlyStripTags is an array. When a program starts, it will filter out any empty strings and strings that can be String.trim()'ed to a zero-length string. It's necessary because a presence on just one string in opts.onlyStripTags will switch this application to delete-only-these mode and it would be bad if empty, falsy or whitespace string value would accidentally cause it.
This option can work in combination with opts.ignoreTags. Any tags listed in opts.ignoreTags will be removed from the tags, listed in opts.onlyStripTags. If there was one or more tag listed in opts.onlyStripTags, the delete-only-these mode will be on and will be respected, even if there will be no tags to remove because all were excluded in opts.onlyStripTags.
This program not only strips HTML and returns a string. It also returns string index locations of removals. This way, you can use this program to extract ranges of indexes which would later be used to skip operations on a string.
For example, npm package title capitalises the titles as per The Chicago Manual of Style. But if input source can contain HTML code, we need to skip processing the HTML tags.
The idea is, it sets opts.stripTogetherWithTheirContents to ["*"] — asterisk or wildcard meaning to "strip" all paired tags (including <code>/</code> in titles, for example). Then we take the locations of all tags and supplement it with locations of what's been whitelisted (using ranges-regex). Finally, we invert the ranges and supplement them with replacement value, third array element, coming from title. Here's the source code.
Sometimes you want more control over the program: maybe you want to strip only certain tags and write your custom conditions, maybe you want to do something extra on tags which are being ignored, for example, fix whitespace within them?
You can get this level of control using opts.cb. In options object, under key's cb value, put a function. Whenever this program wants to do something, it will call your function, Array.forEach(key => {})-style. Instead of key you get a plain object with the following keys:
constcb=({ tag, deleteFrom, deleteTo, insert, rangesArr, proposedReturn, })=>{ if(tag){ // do something depending on what's in the current tag console.log(JSON.stringify(tag,null,4)); } // default action which does nothing different from normal, non-callback operation rangesArr.push(deleteFrom, deleteTo, insert); // you might want to do something different, depending on "tag" contents. }; const{ result }=stripHtml("abc<hr>def",{ cb }); console.log(result);
The tag key contains all the internal data for the particular tag which is being removed. Feel free to console.log(JSON.stringify(tag, null, 4)) it and tap its contents.
The point of this callback interface is to pass the action of pushing of ranges to a user, as opposed to a program. The program will suggest you what it would push to final ranges array, but it's up to you to perform the pushing.
Below, the program "does nothing", that is, you push what it proposes, "proposedReturn" array:
Speaking scientifically, it works from lexer-level, it's a scanerless parser.
In simple language, this program does not use parsing and AST trees. It processes the input string as text. Whatever the algorithm doesn't understand — errors, broken code, non-HTML, etc. — it skips.
For an exported function, string-in, string-out API is awesome because it's simple. The problem happens later when you want to add more to the output, for example, a log with time spent. Or an alternative output, like locations of string indexes. Or the version from package.json.