API¶
charex is primarily designed to be used as a command line
script. However, it does have an API if you have some need for
using it that way.
General Character Information¶
- class charex.Character(value: bytes | int | str)[source]¶
A Unicode character.
- Parameters:
value – A character address string for the Unicode character. See below.
- Returns:
The character as a
charex.Character.- Return type:
- Usage:
To create a
charex.Characterobject for a single character string:>>> value = 'a' >>> char = Character(value) >>> char.value 'a'
To create a
charex.Characterobject for a Unicode code point:>>> value = 'U+0061' >>> char = Character(value) >>> char.value 'a'
To create a
charex.Characterobject for a binary string:>>> value = '0b01100001' >>> char = Character(value) >>> char.value 'a'
To create a
charex.Characterobject for an octal string:>>> value = '0o141' >>> char = Character(value) >>> char.value 'a'
To create a
charex.Characterobject for a decimal string:>>> value = '0d97' >>> char = Character(value) >>> char.value 'a'
To create a
charex.Characterobject for a hex string:>>> value = '0x61' >>> char = Character(value) >>> char.value 'a'
Beyond the declared properties and methods described below, most Unicode properties for the character are available by calling their alias as a property of
charex.Character:>>> value = 'a' >>> char = Character(value) >>> char.na 'LATIN SMALL LETTER A' >>> char.blk 'Basic Latin' >>> char.sc 'Latn' >>> char.suc '0041'
- Address formats:
The understood str-based formats for manual input of addresses are:
Character: A string with length equal to one.
Code Point: The prefix “U+” followed by a hexadecimal number.
Binary String: The prefix “0b” followed by a binary number.
Hex String: The prefix “0x” followed by a hexadecimal number.
The following formats are available for use through the API:
- denormalize(form: str) tuple[str, ...][source]¶
Return the characters that normalize to the character using the given form.
- Parameters:
form – The normalization form to check against.
- Returns:
The denormalization results in a
tuple.- Return type:
- Usage:
To denormalize the character for the given form:
>>> # Create the character object. >>> value = '<' >>> char = Character(value) >>> >>> # Get the denormalizations for the character. >>> form = 'nfkc' >>> char.denormalize(form) ('﹤', '<')
- escape(scheme: str, codec: str = 'utf8') str[source]¶
The escaped version of the character.
- Parameters:
scheme – The escape scheme to use.
codec – The codec to use when escaping to a hexadecimal string.
- Returns:
A
strwith the escaped character.- Return type:
- Usage:
To escape the character with the given form:
>>> value = '<' >>> char = Character(value) >>> >>> scheme = 'html' >>> char.escape(scheme) '<⃒'
- is_normal(form: str) bool[source]¶
Is the character normalized to the given form?
- Parameters:
form – The normalization form to check against.
- Returns:
A
boolindicating whether the character is normalized.- Return type:
- Usage:
To determine whether the character is already normalized for the given scheme.
>>> value = 'å' >>> char = Character(value) >>> >>> form = 'nfc' >>> char.is_normal(form) True
- charex.filter_by_property(prop: str, value: str, chars: Sequence[Character] | None = None, insensitive: bool = False, regex: bool = False) Generator[Character, None, None][source]¶
Return all the characters with the given property value.
- Parameters:
prop – The property to filter on.
value – The pattern to filter on.
chars – (Optional.) The characters to filter. Defaults to filtering all Unicode characters.
insensitive – (Optional.) Whether the matching should be case insensitive. Defaults to false.
regex – (Optional.) Whether the value should be used as a regular expression for the matching. Defaults to false.
- Returns:
the filtered characters as a
collections.abc.Generator.- Return type:
- Usage:
To get a generator that produces the Emoji modifiers:
>>> prop = 'emod' >>> value = 'Y' >>> gen = filter_by_property(prop, value) >>> for char in gen: ... print(char.summarize()) ... 🏻 U+1F3FB (EMOJI MODIFIER FITZPATRICK TYPE-1-2) 🏼 U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE-3) 🏽 U+1F3FD (EMOJI MODIFIER FITZPATRICK TYPE-4) 🏾 U+1F3FE (EMOJI MODIFIER FITZPATRICK TYPE-5) 🏿 U+1F3FF (EMOJI MODIFIER FITZPATRICK TYPE-6)
You can limit the number of characters being searched with the chars parameter:
>>> prop = 'gc' >>> value = 'Cc' >>> chars = [Character(chr(n)) for n in range(128)] >>> gen = filter_by_property(prop, value, chars) >>> for char in gen: ... print(char.summarize()) ... ␀ U+0000 (<NULL>) ␁ U+0001 (<START OF HEADING>) ␂ U+0002 (<START OF TEXT>) ␃ U+0003 (<END OF TEXT>) ␄ U+0004 (<END OF TRANSMISSION>) ␅ U+0005 (<ENQUIRY>) ␆ U+0006 (<ACKNOWLEDGE>) ␇ U+0007 (<BELL>) ␈ U+0008 (<BACKSPACE>) ␉ U+0009 (<CHARACTER TABULATION>) ␊ U+000A (<LINE FEED (LF)>) ␋ U+000B (<LINE TABULATION>) ␌ U+000C (<FORM FEED (FF)>) ␍ U+000D (<CARRIAGE RETURN (CR)>) ␎ U+000E (<SHIFT OUT>) ␏ U+000F (<SHIFT IN>) ␐ U+0010 (<DATA LINK ESCAPE>) ␑ U+0011 (<DEVICE CONTROL ONE>) ␒ U+0012 (<DEVICE CONTROL TWO>) ␓ U+0013 (<DEVICE CONTROL THREE>) ␔ U+0014 (<DEVICE CONTROL FOUR>) ␕ U+0015 (<NEGATIVE ACKNOWLEDGE>) ␖ U+0016 (<SYNCHRONOUS IDLE>) ␗ U+0017 (<END OF TRANSMISSION BLOCK>) ␘ U+0018 (<CANCEL>) ␙ U+0019 (<END OF MEDIUM>) ␚ U+001A (<SUBSTITUTE>) ␛ U+001B (<ESCAPE>) ␜ U+001C (<INFORMATION SEPARATOR FOUR>) ␝ U+001D (<INFORMATION SEPARATOR THREE>) ␞ U+001E (<INFORMATION SEPARATOR TWO>) ␟ U+001F (<INFORMATION SEPARATOR ONE>) ⑿ U+007F (<DELETE>)
You can set the insensitive parameter to do case insensitive matching:
>>> prop = 'emod' >>> value = 'y' >>> insensitive = True >>> gen = filter_by_property(prop, value, insensitive=insensitive) >>> for char in gen: ... print(char.summarize()) ... 🏻 U+1F3FB (EMOJI MODIFIER FITZPATRICK TYPE-1-2) 🏼 U+1F3FC (EMOJI MODIFIER FITZPATRICK TYPE-3) 🏽 U+1F3FD (EMOJI MODIFIER FITZPATRICK TYPE-4) 🏾 U+1F3FE (EMOJI MODIFIER FITZPATRICK TYPE-5) 🏿 U+1F3FF (EMOJI MODIFIER FITZPATRICK TYPE-6)
If you set the regex parameter, you can search using regular expressions:
>>> prop = 'na' >>> value = '.*EYE$' >>> regex = True >>> gen = filter_by_property(prop, value, regex=regex) >>> for char in gen: ... print(char.summarize()) ... ◉ U+25C9 (FISHEYE) ◎ U+25CE (BULLSEYE) ⺫ U+2EAB (CJK RADICAL EYE) ⽬ U+2F6C (KANGXI RADICAL EYE) 👁 U+1F441 (EYE) 😜 U+1F61C (FACE WITH STUCK-OUT TONGUE AND WINKING EYE) 🤪 U+1F92A (GRINNING FACE WITH ONE LARGE AND ONE SMALL EYE) 🫣 U+1FAE3 (FACE WITH PEEKING EYE)
Character Set Information¶
- charex.multidecode(value: int | str | bytes, codecs_: Iterator[str] | None = None) dict[str, str][source]¶
Provide the character for the given address for each of the given character sets.
- Parameters:
value – The address to decode.
codec – The codecs to decode to.
- Returns:
The decoded value for each character set as a
dict.- Return type:
- Usage:
To get the character for the given address for each of the registered codecs:
>>> address = '0x61' >>> multidecode(address) # +ELLIPSIS {'ascii': 'a', 'big5': 'a'... 'utf_8_sig': 'a'}
If you just want the UTF-8 character:
>>> value = 'a' >>> codecs_ = ('utf_8',) >>> multidecode(value, codecs_) {'utf_8': 'a'}
- Address formats:
The understood
strformats for manual input are:Character: A string with length equal to one.
Code Point: The prefix “U+” followed by a hexadecimal number.
Binary String: The prefix “0b” followed by a binary number.
Hex String: The prefix “0x” followed by a hexadecimal number.
The following formats are available for use through the API:
- charex.multiencode(value: bytes | int | str, codecs_: Iterator[str] | None = None) dict[str, bytes][source]¶
Provide the address for the given character for each of the given character sets.
- Parameters:
value – The character to encode.
codecs – The codecs to encode to.
- Returns:
The encoded value for each character set as a
dict.- Return type:
- Usage:
To encode a one character
strwith all registered codecs:>>> value = 'a' >>> multiencode(value) # +ELLIPSIS {'ascii': b'a', 'big5': b'a'... 'utf_8_sig': b'a'}
If you just want the UTF-8 address:
>>> value = 'a' >>> codecs_ = ('utf_8',) >>> multiencode(value, codecs_) {'utf_8': b'a'}
- Character formats:
The understood
strformats available for manual input are (all formats are big endian unless otherwise stated):Character: A string with length equal to one.
Code Point: The prefix “U+” followed by a hexadecimal number.
Binary String: The prefix “0b” followed by a binary number.
Octal String: The prefix “0o” followed by an octal number.
Decimal String: The prefix “0d” followed by a decimal number.
Hex String: The prefix “0x” followed by a hexadecimal number.
The following formats are available for use through the API:
Character Escaping¶
- charex.escape_text(s: str, schemekey: str, codec: str = 'utf8') str¶
Escape the string with the scheme.
- class charex.reg_escape(key: str)[source]¶
A decorator for registering escape schemes.
- Parameters:
key – The name the escape sequence is registered under.
- Usage:
To register a new escape scheme:
>>> @reg_escape('double') ... def double(char: str, codec: str) -> str: ... '''Double the character.''' ... return char + char ... >>> # Demonstrate the registration worked. >>> 'double' in get_schemes() True >>> escape_text('spam', 'double') 'ssppaamm'
Normalization and Denormalization¶
- charex.count_denormalizations(base: str, form: str, maxdepth: int | None = None) int[source]¶
Determine the number of denormalizations that exist for the string.
- Parameters:
base – The
strto denormalize.form – The Unicode normalization form to denormalize from. Valid values are: casefold, nfc, nfd, nfkc, nfkd.
maxdepth – (Optional.) How many individual characters to use when denormalizing the base. This is used to limit the total number of denormalizations of the overall base.
- Returns:
The number of denormalizations as an
int.- Return type:
- Usage:
To count the number of possible denormalizations for a given string and form:
>>> base = '<->' >>> form = 'nfkc' >>> count_denormalizations(base, form) 8
- charex.denormalize(base: str, form: str, maxdepth: int = 0, maxresults: int | None = None, random: bool = False, seed_: bytes | int | str = '') tuple[str, ...][source]¶
Denormalize a string.
- Parameters:
base – The
strto denormalize.form – The Unicode normalization form to denormalize from. Valid values are: casefold, nfc, nfd, nfkc, nfkd.
maxdepth – (Optional.) How many denormalizations per character in the base string to use when denormalizing the base. This is used to limit the total number of denormalizations of the overall base. If maxdepth is zero, the number of denormalizations to use per character is not limited.
maxresults – (Optional.) The maximum number of results to return. Default behavior varies based on the random parameter. If random is False, default is to return all possible denormalizattions. Otherwise, the default is to return one.
random – (Optional.) Whether to pick randomly from the possible denormalization results. Defaults to false.
seed – (Optional.) A seed value for the random number generator. Defaults to not seeding the generator.
- Returns:
The denormalizations as a
tuple.- Return type:
- Usage:
To denormalize a given string with the given form:
>>> base = '<>' >>> form = 'nfkc' >>> denormalize(base, form) ('﹤﹥', '﹤>', '<﹥', '<>')
The maxdepth parameter can be used to limit the number of denormalizations per character in the base string. This is useful when you want just a few denormalizations of a string with a very large number of denormalizations:
>>> base = 'hi' >>> form = 'nfkc' >>> maxdepth = 2 >>> denormalize(base, form, maxdepth) ('ʰᵢ', 'ʰⁱ', 'ₕᵢ', 'ₕⁱ')
- charex.gen_denormalize(base: str, form: str, maxdepth: int = 0) Generator[str, None, None][source]¶
Denormalize a string, yielding the results as they are generated.
- Parameters:
base – The
strto denormalize.form – The Unicode normalization form to denormalize from. Valid values are: casefold, nfc, nfd, nfkc, nfkd.
maxdepth – (Optional.) How many denormalizations per character in the base string to use when denormalizing the base. This is used to limit the total number of denormalizations of the overall base. If maxdepth is zero, the number of denormalizations to use per character is not limited.
- Returns:
A
collections.abc.Generatorthat yields the denormalization results.- Return type:
- Usage:
To generate denormalizations for a given string with a given form:
>>> base = '<>' >>> form = 'nfkc' >>> dngen = gen_denormalize(base, form) >>> [result for result in dngen] ['﹤﹥', '﹤>', '<﹥', '<>']
The maxdepth parameter can be used to limit the number of denormalizations per character in the base string. This is useful when you want just a few denormalizations of a string with a very large number of denormalizations:
>>> base = 'hi' >>> form = 'nfkc' >>> maxdepth = 2 >>> dngen = gen_denormalize(base, form, maxdepth) >>> [result for result in dngen] ['ʰᵢ', 'ʰⁱ', 'ₕᵢ', 'ₕⁱ']
- charex.gen_random_denormalize(base: str, form: str, maxresults: int = 1, seed_: bytes | int | str = '') Generator[str, None, None][source]¶
Randomly denormalize a string, yielding the results as they are generated. This is useful when returning all results for a denormalization is unreasonably large, as can easily happen when denormalizing strings containing Latin letters.
- Parameters:
base – The
strto denormalize.form – The Unicode normalization for to denormalize from. Valid values are: NFC, NFD, NFKC, NFKD.
maxresults – (Optional.) The maximum number of results to return. The default is to return one.
seed – (Optional.) A seed value for the random number generator. Defaults to not seeding the generator.
- Returns:
A
collections.abc.Generatorthat yields the random denormalization results.- Return type:
- Usage:
To generate a random denormalization of a given string with a given form:
>>> base = '<script>' >>> form = 'nfkc' >>> dngrd = gen_random_denormalize(base, form) >>> [result for result in dngrd] ['﹤𝓈ᶜ𝕣𝚒𝙥𝙩>']
The maxresults parameter tells the generator to return the given number of results:
>>> base = '<script>' >>> form = 'nfkc' >>> maxresults = 3 >>> dngrd = gen_random_denormalize(base, form, maxresults) >>> [result for result in dngrd] ['﹤𝓈ᶜ𝕣𝚒𝙥𝙩>', '<𝖘ᶜ𝓇𝕚ᵖ𝓉>', '﹤𝙨𝚌𝑟𝗂𝐩t>']
- class charex.reg_form(key: str)[source]¶
A decorator for registering normalization forms.
- Parameters:
key – The name the normalization form is registered under.
- Returns:
A
charex.reg_formobject.- Return type:
- Usage:
To register a normalization form:
>>> from charex import * >>> >>> @reg_form('a') ... def form_a(base: str) -> str: ... '''Make all strings into the letter A.''' ... return 'A' ... >>> # Demonstrate the registration worked. >>> 'a' in get_forms() True >>> normalize('a', 'spam') 'A'
Unicode Information¶
- charex.alias_property(longname: str, space: bool = True) str[source]¶
Translate the long name of a Unicode property into the alias for that property.
- Parameters:
longname – The long name for the property.
space – (Optional.) Whether to replace spaces in the long name with underscores. Defaults to True.
- Returns:
The alias as a
str.- Return type:
- Usage:
To get the alias of a Unicode property:
>>> longname = 'Case Folding' >>> alias_property(longname) 'cf'
- charex.get_property_values(prop: str) tuple[str, ...][source]¶
Get the valid property value aliases for a property.
- charex.expand_property(prop: str) str[source]¶
Translate the short name of a Unicode property into the long name for that property.