API¶

charex is primarily designed to be used as a command line script. However, it does have an API if you have some need for using it that way.

General Character Information¶

class charex.Character(value: bytes | int | str)[source]¶

A Unicode character.

Parameters:

value – A character address string for the Unicode character. See below.

Returns:

The character as a charex.Character.

Return type:

charex.Character

Usage:

To create a charex.Character object for a single character string:

>>> value = 'a'
>>> char = Character(value)
>>> char.value
'a'

To create a charex.Character object for a Unicode code point:

>>> value = 'U+0061'
>>> char = Character(value)
>>> char.value
'a'

To create a charex.Character object for a binary string:

>>> value = '0b01100001'
>>> char = Character(value)
>>> char.value
'a'

To create a charex.Character object for an octal string:

>>> value = '0o141'
>>> char = Character(value)
>>> char.value
'a'

To create a charex.Character object for a decimal string:

>>> value = '0d97'
>>> char = Character(value)
>>> char.value
'a'

To create a charex.Character object for a hex string:

>>> value = '0x61'
>>> char = Character(value)
>>> char.value
'a'

Beyond the declared properties and methods described below, most Unicode properties for the character are available by calling their alias as a property of charex.Character:

>>> value = 'a'
>>> char = Character(value)
>>> char.na
'LATIN SMALL LETTER A'
>>> char.blk
'Basic Latin'
>>> char.sc
'Latn'
>>> char.suc
'0041'

Address formats:

The understood str-based formats for manual input of addresses are:

Character: A string with length equal to one.

Code Point: The prefix “U+” followed by a hexadecimal number.

Binary String: The prefix “0b” followed by a binary number.

Hex String: The prefix “0x” followed by a hexadecimal number.

The following formats are available for use through the API:

Bytes: A bytes.

Integer: An int.

property code_point: str¶: The address for the character in the Unicode database.

denormalize(form: str) → tuple[str, ...][source]¶

Return the characters that normalize to the character using the given form.

Parameters:

form – The normalization form to check against.

Returns:

The denormalization results in a tuple.

Return type:

tuple

Usage:

To denormalize the character for the given form:

>>> # Create the character object.
>>> value = '<'
>>> char = Character(value)
>>>
>>> # Get the denormalizations for the character.
>>> form = 'nfkc'
>>> char.denormalize(form)
('﹤', '＜')

encode(codec: str) → str[source]¶

The hexadecimal value for the character in the given character set.

Parameters:

codec – The codec to use when encoding to a hexadecimal string.

Returns:

A str with the encoded character.

Return type:

str

Usage:

To encode the character with the given character set:

>>> value = 'å'
>>> char = Character(value)
>>>
>>> codec = 'utf8'
>>> char.encode(codec)
'C3 A5'

escape(scheme: str, codec: str = 'utf8') → str[source]¶

The escaped version of the character.

Parameters:

scheme – The escape scheme to use.
codec – The codec to use when escaping to a hexadecimal string.

Returns:

A str with the escaped character.

Return type:

str

Usage:

To escape the character with the given form:

>>> value = '<'
>>> char = Character(value)
>>>
>>> scheme = 'html'
>>> char.escape(scheme)
'&nvlt;'

is_normal(form: str) → bool[source]¶

Is the character normalized to the given form?

Parameters:

form – The normalization form to check against.

Returns:

A bool indicating whether the character is normalized.

Return type:

bool

Usage:

To determine whether the character is already normalized for the given scheme.

>>> value = 'å'
>>> char = Character(value)
>>>
>>> form = 'nfc'
>>> char.is_normal(form)
True

normalize(form: str) → str[source]¶

Normalize the character using the given form.

Parameters:

form – The normalization form to check against.

Returns:

The normalization result as a str.

Return type:

str

Usage:

To normalize the character for the given form:

>>> value = '＜'
>>> char = Character(value)
>>>
>>> form = 'nfkc'
>>> char.normalize(form)
'<'

summarize() → str[source]¶

Return a summary of the character’s information.

Returns:

The character information as a str.

Return type:

str

Usage:

To summarize the character:

>>> value = 'å'
>>> char = Character(value)
>>>
>>> char.summarize()
'å U+00E5 (LATIN SMALL LETTER A WITH RING ABOVE)'

property value: str¶: The Unicode character as a string.

charex.filter_by_property(prop: str, value: str, chars: Sequence[Character] | None = None, insensitive: bool = False, regex: bool = False) → Generator[Character, None, None][source]¶

Return all the characters with the given property value.

Parameters:

prop – The property to filter on.
value – The pattern to filter on.
chars – (Optional.) The characters to filter. Defaults to filtering all Unicode characters.
insensitive – (Optional.) Whether the matching should be case insensitive. Defaults to false.
regex – (Optional.) Whether the value should be used as a regular expression for the matching. Defaults to false.

Returns:

the filtered characters as a collections.abc.Generator.

Return type:

collections.abc.Generator

Usage:

To get a generator that produces the Emoji modifiers:

>>> prop = 'emod'
>>> value = 'Y'
>>> gen = filter_by_property(prop, value)
>>> for char in gen:
...     print(char.summarize())
...
🏻 U+1F3FB (EMOJI MODIFIER FITZPATRICK TYPE-1-2)
🏼 U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE-3)
🏽 U+1F3FD (EMOJI MODIFIER FITZPATRICK TYPE-4)
🏾 U+1F3FE (EMOJI MODIFIER FITZPATRICK TYPE-5)
🏿 U+1F3FF (EMOJI MODIFIER FITZPATRICK TYPE-6)

You can limit the number of characters being searched with the chars parameter:

>>> prop = 'gc'
>>> value = 'Cc'
>>> chars = [Character(chr(n)) for n in range(128)]
>>> gen = filter_by_property(prop, value, chars)
>>> for char in gen:
...     print(char.summarize())
...
␀ U+0000 (<NULL>)
␁ U+0001 (<START OF HEADING>)
␂ U+0002 (<START OF TEXT>)
␃ U+0003 (<END OF TEXT>)
␄ U+0004 (<END OF TRANSMISSION>)
␅ U+0005 (<ENQUIRY>)
␆ U+0006 (<ACKNOWLEDGE>)
␇ U+0007 (<BELL>)
␈ U+0008 (<BACKSPACE>)
␉ U+0009 (<CHARACTER TABULATION>)
␊ U+000A (<LINE FEED (LF)>)
␋ U+000B (<LINE TABULATION>)
␌ U+000C (<FORM FEED (FF)>)
␍ U+000D (<CARRIAGE RETURN (CR)>)
␎ U+000E (<SHIFT OUT>)
␏ U+000F (<SHIFT IN>)
␐ U+0010 (<DATA LINK ESCAPE>)
␑ U+0011 (<DEVICE CONTROL ONE>)
␒ U+0012 (<DEVICE CONTROL TWO>)
␓ U+0013 (<DEVICE CONTROL THREE>)
␔ U+0014 (<DEVICE CONTROL FOUR>)
␕ U+0015 (<NEGATIVE ACKNOWLEDGE>)
␖ U+0016 (<SYNCHRONOUS IDLE>)
␗ U+0017 (<END OF TRANSMISSION BLOCK>)
␘ U+0018 (<CANCEL>)
␙ U+0019 (<END OF MEDIUM>)
␚ U+001A (<SUBSTITUTE>)
␛ U+001B (<ESCAPE>)
␜ U+001C (<INFORMATION SEPARATOR FOUR>)
␝ U+001D (<INFORMATION SEPARATOR THREE>)
␞ U+001E (<INFORMATION SEPARATOR TWO>)
␟ U+001F (<INFORMATION SEPARATOR ONE>)
⑿ U+007F (<DELETE>)

You can set the insensitive parameter to do case insensitive matching:

>>> prop = 'emod'
>>> value = 'y'
>>> insensitive = True
>>> gen = filter_by_property(prop, value, insensitive=insensitive)
>>> for char in gen:
...     print(char.summarize())
...
🏻 U+1F3FB (EMOJI MODIFIER FITZPATRICK TYPE-1-2)
🏼 U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE-3)
🏽 U+1F3FD (EMOJI MODIFIER FITZPATRICK TYPE-4)
🏾 U+1F3FE (EMOJI MODIFIER FITZPATRICK TYPE-5)
🏿 U+1F3FF (EMOJI MODIFIER FITZPATRICK TYPE-6)

If you set the regex parameter, you can search using regular expressions:

>>> prop = 'na'
>>> value = '.*EYE$'
>>> regex = True
>>> gen = filter_by_property(prop, value, regex=regex)
>>> for char in gen:
...     print(char.summarize())
...
◉ U+25C9 (FISHEYE)
◎ U+25CE (BULLSEYE)
⺫ U+2EAB (CJK RADICAL EYE)
⽬ U+2F6C (KANGXI RADICAL EYE)
👁 U+1F441 (EYE)
😜 U+1F61C (FACE WITH STUCK-OUT TONGUE AND WINKING EYE)
🤪 U+1F92A (GRINNING FACE WITH ONE LARGE AND ONE SMALL EYE)
🫣 U+1FAE3 (FACE WITH PEEKING EYE)

Character Set Information¶

charex.get_codecs() → tuple[str, ...][source]¶

Return the keys of the registered codecs.

Returns:

The keys of the codecs as a tuple.

Return type:

tuple

Usage:

To get a tuple containing the keys of the registered codecs:

>>> get_codecs()                        # +ELLIPSIS
('ascii', 'big5', 'big5hkscs', 'cp037'... 'utf_8', 'utf_8_sig')

charex.multidecode(value: int | str | bytes, codecs_: Iterator[str] | None = None) → dict[str, str][source]¶

Provide the character for the given address for each of the given character sets.

Parameters:

value – The address to decode.
codec – The codecs to decode to.

Returns:

The decoded value for each character set as a dict.

Return type:

dict

Usage:

To get the character for the given address for each of the registered codecs:

>>> address = '0x61'
>>> multidecode(address)                # +ELLIPSIS
{'ascii': 'a', 'big5': 'a'... 'utf_8_sig': 'a'}

If you just want the UTF-8 character:

>>> value = 'a'
>>> codecs_ = ('utf_8',)
>>> multidecode(value, codecs_)
{'utf_8': 'a'}

Address formats:

The understood str formats for manual input are:

Character: A string with length equal to one.

Code Point: The prefix “U+” followed by a hexadecimal number.

Binary String: The prefix “0b” followed by a binary number.

Hex String: The prefix “0x” followed by a hexadecimal number.

The following formats are available for use through the API:

Bytes: A bytes.

Integer: An int.

charex.multiencode(value: bytes | int | str, codecs_: Iterator[str] | None = None) → dict[str, bytes][source]¶

Provide the address for the given character for each of the given character sets.

Parameters:

value – The character to encode.
codecs – The codecs to encode to.

Returns:

The encoded value for each character set as a dict.

Return type:

dict

Usage:

To encode a one character str with all registered codecs:

>>> value = 'a'
>>> multiencode(value)                  # +ELLIPSIS
{'ascii': b'a', 'big5': b'a'... 'utf_8_sig': b'ï»¿a'}

If you just want the UTF-8 address:

>>> value = 'a'
>>> codecs_ = ('utf_8',)
>>> multiencode(value, codecs_)
{'utf_8': b'a'}

Character formats:

The understood str formats available for manual input are (all formats are big endian unless otherwise stated):

Character: A string with length equal to one.

Code Point: The prefix “U+” followed by a hexadecimal number.

Binary String: The prefix “0b” followed by a binary number.

Octal String: The prefix “0o” followed by an octal number.

Decimal String: The prefix “0d” followed by a decimal number.

Hex String: The prefix “0x” followed by a hexadecimal number.

The following formats are available for use through the API:

Bytes: A bytes that decodes to a valid UTF-8 character.

Integer: An int within the range 0x00 <= x <= 0x10FFFF.

Character Escaping¶

charex.escape_text(s: str, schemekey: str, codec: str = 'utf8') → str¶

Escape the string with the scheme.

Parameters:

s – The string to escape.
scheme – The key in the schemes dict to use for the escaping.
codec – The character set codec to use when escaping the characters.

Returns:

The escaped str.

Return type:

str

charex.get_schemes() → tuple[str, ...][source]¶

Return the keys of the registered escape schemes.

Returns:: The scheme keys as a tuple.
Return type:: tuple

class charex.reg_escape(key: str)[source]¶

A decorator for registering escape schemes.

Parameters:

key – The name the escape sequence is registered under.

Usage:

To register a new escape scheme:

>>> @reg_escape('double')
... def double(char: str, codec: str) -> str:
...     '''Double the character.'''
...     return char + char
...
>>> # Demonstrate the registration worked.
>>> 'double' in get_schemes()
True
>>> escape_text('spam', 'double')
'ssppaamm'

Normalization and Denormalization¶

charex.count_denormalizations(base: str, form: str, maxdepth: int | None = None) → int[source]¶

Determine the number of denormalizations that exist for the string.

Parameters:

base – The str to denormalize.
form – The Unicode normalization form to denormalize from. Valid values are: casefold, nfc, nfd, nfkc, nfkd.
maxdepth – (Optional.) How many individual characters to use when denormalizing the base. This is used to limit the total number of denormalizations of the overall base.

Returns:

The number of denormalizations as an int.

Return type:

int

Usage:

To count the number of possible denormalizations for a given string and form:

>>> base = '<->'
>>> form = 'nfkc'
>>> count_denormalizations(base, form)
8

charex.denormalize(base: str, form: str, maxdepth: int = 0, maxresults: int | None = None, random: bool = False, seed_: bytes | int | str = '') → tuple[str, ...][source]¶

Denormalize a string.

Parameters:

base – The str to denormalize.
form – The Unicode normalization form to denormalize from. Valid values are: casefold, nfc, nfd, nfkc, nfkd.
maxdepth – (Optional.) How many denormalizations per character in the base string to use when denormalizing the base. This is used to limit the total number of denormalizations of the overall base. If maxdepth is zero, the number of denormalizations to use per character is not limited.
maxresults – (Optional.) The maximum number of results to return. Default behavior varies based on the random parameter. If random is False, default is to return all possible denormalizattions. Otherwise, the default is to return one.
random – (Optional.) Whether to pick randomly from the possible denormalization results. Defaults to false.
seed – (Optional.) A seed value for the random number generator. Defaults to not seeding the generator.

Returns:

The denormalizations as a tuple.

Return type:

tuple

Usage:

To denormalize a given string with the given form:

>>> base = '<>'
>>> form = 'nfkc'
>>> denormalize(base, form)
('﹤﹥', '﹤＞', '＜﹥', '＜＞')

The maxdepth parameter can be used to limit the number of denormalizations per character in the base string. This is useful when you want just a few denormalizations of a string with a very large number of denormalizations:

>>> base = 'hi'
>>> form = 'nfkc'
>>> maxdepth = 2
>>> denormalize(base, form, maxdepth)
('ʰᵢ', 'ʰⁱ', 'ₕᵢ', 'ₕⁱ')

charex.gen_denormalize(base: str, form: str, maxdepth: int = 0) → Generator[str, None, None][source]¶

Denormalize a string, yielding the results as they are generated.

Parameters:

base – The str to denormalize.
form – The Unicode normalization form to denormalize from. Valid values are: casefold, nfc, nfd, nfkc, nfkd.
maxdepth – (Optional.) How many denormalizations per character in the base string to use when denormalizing the base. This is used to limit the total number of denormalizations of the overall base. If maxdepth is zero, the number of denormalizations to use per character is not limited.

Returns:

A collections.abc.Generator that yields the denormalization results.

Return type:

collections.abc.Generator

Usage:

To generate denormalizations for a given string with a given form:

>>> base = '<>'
>>> form = 'nfkc'
>>> dngen = gen_denormalize(base, form)
>>> [result for result in dngen]
['﹤﹥', '﹤＞', '＜﹥', '＜＞']

>>> base = 'hi'
>>> form = 'nfkc'
>>> maxdepth = 2
>>> dngen = gen_denormalize(base, form, maxdepth)
>>> [result for result in dngen]
['ʰᵢ', 'ʰⁱ', 'ₕᵢ', 'ₕⁱ']

charex.gen_random_denormalize(base: str, form: str, maxresults: int = 1, seed_: bytes | int | str = '') → Generator[str, None, None][source]¶

Randomly denormalize a string, yielding the results as they are generated. This is useful when returning all results for a denormalization is unreasonably large, as can easily happen when denormalizing strings containing Latin letters.

Parameters:

base – The str to denormalize.
form – The Unicode normalization for to denormalize from. Valid values are: NFC, NFD, NFKC, NFKD.
maxresults – (Optional.) The maximum number of results to return. The default is to return one.
seed – (Optional.) A seed value for the random number generator. Defaults to not seeding the generator.

Returns:

A collections.abc.Generator that yields the random denormalization results.

Return type:

collections.abc.Generator

Usage:

To generate a random denormalization of a given string with a given form:

>>> base = '<script>'
>>> form = 'nfkc'
>>> dngrd = gen_random_denormalize(base, form)
>>> [result for result in dngrd]
['﹤𝓈ᶜ𝕣𝚒𝙥𝙩＞']

The maxresults parameter tells the generator to return the given number of results:

>>> base = '<script>'
>>> form = 'nfkc'
>>> maxresults = 3
>>> dngrd = gen_random_denormalize(base, form, maxresults)
>>> [result for result in dngrd]
['﹤𝓈ᶜ𝕣𝚒𝙥𝙩＞', '＜𝖘ᶜ𝓇𝕚ᵖ𝓉＞', '﹤𝙨𝚌𝑟𝗂𝐩ｔ＞']

charex.get_forms() → tuple[str, ...][source]¶

Return the keys of the registered normalization forms.

Returns:

The names of the normalization forms as a tuple.

Return type:

tuple

Usage:

To get a tuple of the registered normalization forms:

>>> get_forms()
('casefold', 'nfc', 'nfd', 'nfkc', 'nfkd')

charex.normalize(formkey: str, base: str) → str[source]¶

Normalize the base string with the form.

Parameters:

formkey – The key of a registered normalization form.
base – The string to normalize.

Returns:

The normalized str.

Return type:

str

Usage:

To normalize a string using the given form:

>>> value = 'SPAM'
>>> form = 'casefold'
>>> normalize(form, value)
'spam'

class charex.reg_form(key: str)[source]¶

A decorator for registering normalization forms.

Parameters:

key – The name the normalization form is registered under.

Returns:

A charex.reg_form object.

Return type:

charex.reg_form

Usage:

To register a normalization form:

>>> from charex import *
>>>
>>> @reg_form('a')
... def form_a(base: str) -> str:
...     '''Make all strings into the letter A.'''
...     return 'A'
...
>>> # Demonstrate the registration worked.
>>> 'a' in get_forms()
True
>>> normalize('a', 'spam')
'A'

Unicode Information¶

charex.alias_property(longname: str, space: bool = True) → str[source]¶

Translate the long name of a Unicode property into the alias for that property.

Parameters:

longname – The long name for the property.
space – (Optional.) Whether to replace spaces in the long name with underscores. Defaults to True.

Returns:

The alias as a str.

Return type:

str

Usage:

To get the alias of a Unicode property:

>>> longname = 'Case Folding'
>>> alias_property(longname)
'cf'

charex.get_properties() → tuple[str, ...][source]¶

Get the valid Unicode properties.

Returns:

The properties as a tuple.

Return type:

tuple

Usage:

To get the list of Unicode properties:

>>> get_properties()
('age', 'ahex',... 'xo_nfkd')

charex.get_property_values(prop: str) → tuple[str, ...][source]¶

Get the valid property value aliases for a property.

Parameters:

prop – The short name of the property.

Returns:

The valid values for the property as a tuple.

Return type:

tuple

Usage:

To get the valid property values:

>>> prop = 'gc'
>>> get_property_values(prop)
('C', 'Cc', 'Cf', 'Cn', 'Co', 'Cs', 'L',... 'Zs')

charex.expand_property(prop: str) → str[source]¶

Translate the short name of a Unicode property into the long name for that property.

Parameters:

prop – The short name of the property.

Returns:

The long name as a str.

Return type:

str

Usage:

To get the long name of a Unicode property:

>>> prop = 'cf'
>>> expand_property(prop)
'Case Folding'

charex.expand_property_value(prop: str, alias: str) → str[source]¶

Translate the short name of a Unicode property value into the long name for that property.

Parameters:

prop – The type of property.
alias – The short name to translate.

Returns:

The long name of the property as a str.

Return type:

str

Usage:

To get the long name for a property value:

>>> alias = 'Cc'
>>> prop = 'gc'
>>> expand_property_value(prop, alias)
'Control'