string-convert-indexes4.1.0

Convert between native JS string character indexes and grapheme-count-based indexes

Quick Take

import { strict as assert } from "assert";
import {
  nativeToUnicode,
  unicodeToNative,
} from "string-convert-indexes";

// CONVERTING NATIVE JS INDEXES TO UNICODE-CHAR-COUNT-BASED
// ๐Œ† - \uD834\uDF06

// at index 1, we have low surrogate, that's still grapheme index zero
assert.equal(
  nativeToUnicode("\uD834\uDF06aa", "1"),
  "0"
);
// notice it's retained as string. The same type as input is retained!

// at index 2, we have first letter a - that's second index, counting graphemes
assert.equal(nativeToUnicode("\uD834\uDF06aa", 3), 2);

// convert many indexes at once - any nested data structure is fine:
assert.deepEqual(
  nativeToUnicode("\uD834\uDF06aa", [1, 0, 2, 3]),
  [0, 0, 1, 2]
);

// numbers from an AST-like complex structure are still picked out and converted:
assert.deepEqual(
  nativeToUnicode("\uD834\uDF06aa", [
    1,
    "0",
    [[[2]]],
    3,
  ]),
  [
    0, // notice matching type is retained
    "0", // notice matching type is retained
    [[[1]]],
    2,
  ]
);

// CONVERTING UNICODE-CHAR-COUNT-BASED TO NATIVE JS INDEXES
// ๐Œ† - \uD834\uDF06

assert.deepEqual(
  unicodeToNative("\uD834\uDF06aa", [0, 1, 2]),
  [0, 2, 3]
);

assert.deepEqual(
  unicodeToNative("\uD834\uDF06aa", [1, 0, 2]),
  [2, 0, 3]
);

assert.throws(() =>
  unicodeToNative("\uD834\uDF06aa", [1, 0, 2, 3])
);
// throws an error!
// that's because there's no character (counting Unicode characters) with index 3
// we have only three Unicode characters, so indexes go only up until 2

Idea

Native JS string index system is not based on grapheme count โ€” while "a" length is one, emoji "๐Ÿงข" is two-character-long, because it's two characters actually, \uD83E and \uDDE2.

In ideal world, JS string index system would count emoji as one character-long. That's so-called grapheme-based index system. Letter "a" and cap emoji "๐Ÿงข" are both graphemes.

This program is a converter that converts between the two systems, it's based on grapheme-splitter opens in a new tab.

API

This program exports two functions:

nativeToUnicode(str, indexes)

It converts JS native indexes to indexes (used in let's say String.slice()), based on grapheme count.

... and ...

unicodeToNative(str, indexes)

It converts grapheme count-based indexes to JS native indexes.

API - Input

API for both functions, nativeToUnicode() and unicodeToNative() is the same:

Input argument Type Obligatory? Description
str String yes The string in which you want to perform a search
indexes Whatever yes Normally a natural number or zero but it can be numeric string or nested AST of thereof.

Changelog

See it in the monorepo opens in a new tab, on GitHub.

Contributing

To report bugs or request features or assistance, raise an issue on GitHub opens in a new tab.

Any code contributions welcome! All Pull Requests will be dealt promptly.

Licence

MIT opens in a new tab

Copyright ยฉ 2010โ€“2021 Roy Revelt and other contributors

Related packages:

๐Ÿ“ฆ grapheme-splitter opens in a new tab
A JavaScript library that breaks strings into their individual user-perceived characters. It supports emojis!
๐Ÿ“ฆ ast-monkey-traverse 2.1.0
Utility library to traverse AST
๐Ÿ“ฆ string-uglify 1.5.0
Shorten sets of strings deterministically, to be git-friendly
๐Ÿ“ฆ string-split-by-whitespace 2.1.0
Split string into array by chunks of whitespace
๐Ÿ“ฆ string-apostrophes 1.5.0
Comprehensive, HTML-entities-aware tool to typographically-correct the apostrophes and single/double quotes
๐Ÿ“ฆ string-unfancy 4.1.0
Replace all n/m dashes, curly quotes with their simpler equivalents
๐Ÿ“ฆ string-fix-broken-named-entities 5.4.0
Finds and fixes common and not so common broken named HTML entities, returns ranges array of fixes