Package xmlschema_acue :: Module codepoints

Module codepoints

source code

This module defines Unicode character categories and blocks, defined as sets of code points.

Classes
  UnicodeSubset
Represent a subset of Unicode code points, implemented with an ordered list of integer values and ranges.
Functions
 
code_point_order(cp)
Ordering function for code points.
source code
 
code_point_reverse_order(cp)
Reverse ordering function for code points.
source code
 
iter_code_points(code_points, reverse=False)
Iterates a code points sequence.
source code
 
check_code_point(cp)
Checks a code point or code point range.
source code
 
code_point_repr(cp)
Returns the string representation of a code point.
source code
 
iterparse_character_group(s, expand_ranges=False)
Parse a regex character group part, generating a sequence of code points and code points ranges.
source code
 
get_unicodedata_categories()
Extracts Unicode categories information from unicodedata library.
source code
 
save_unicode_categories(filename=None)
Save Unicode categories to a JSON file.
source code
 
build_unicode_categories(filename=None)
Builds the Unicode categories as `UnicodeSubset` instances.
source code
Variables
  CHARACTER_GROUP_ESCAPED = {ord(c) for c in r'-|.^?*+{}()[]\'}
Code Points of escaped chars in a character group.
  UCS4_MAXUNICODE = 1114111
  UNICODE_CATEGORIES = build_unicode_categories()
  UNICODE_BLOCKS = {'IsBasicLatin': UnicodeSubset('-'), 'IsLat...
Function Details

iter_code_points(code_points, reverse=False)

source code 

Iterates a code points sequence. The code points are accorpated in ranges when are contiguous.

:param code_points: an iterable with code points and code point ranges. :param reverse: if `True` reverses the order of the sequence. :return: yields code points or code point ranges.

Decorators:

check_code_point(cp)

source code 

Checks a code point or code point range.

:return: a valid code point range.

Decorators:

code_point_repr(cp)

source code 

Returns the string representation of a code point.

:param cp: an integer or a tuple with at least two integers. Values must be in interval [0, sys.maxunicode].

Decorators:

iterparse_character_group(s, expand_ranges=False)

source code 

Parse a regex character group part, generating a sequence of code points and code points ranges. An unescaped hyphen (-) that is not at the start or at the and is interpreted as range specifier.

:param s: a string representing a character group part. :param expand_ranges: if set to `True` then expands character ranges. :return: yields integers or couples of integers.

Decorators:

get_unicodedata_categories()

source code 

Extracts Unicode categories information from unicodedata library. Each category is represented with an ordered list containing code points and code point ranges.

:return: a dictionary with category names as keys and lists as values.

Decorators:

save_unicode_categories(filename=None)

source code 

Save Unicode categories to a JSON file.

:param filename: the JSON file to save. If it's `None` uses the predefined filename 'unicode_categories.json' and try to save in the directory of this module.

Decorators:

build_unicode_categories(filename=None)

source code 

Builds the Unicode categories as `UnicodeSubset` instances. For a fast building a pre-built JSON file with Unicode categories data can be used. If the JSON file is missing or is not accessible the categories data is rebuild using `unicodedata.category()` API.

:param filename: the name of the JSON file to load for a fast building of the categories. If not provided the predefined filename 'unicode_categories.json' is used. :return: a dictionary that associates Unicode category names with `UnicodeSubset` instances.

Decorators:

Variables Details

UNICODE_BLOCKS

Value:
{'IsBasicLatin': UnicodeSubset('-'), 'IsLatin-1Supplement': UnicodeS\
ubset('€-ÿ'), 'IsLatinExtended-A': UnicodeSubset('Ā-ſ'), 'IsLatinExten\
ded-B': UnicodeSubset('ƀ-ɏ'), 'IsIPAExtensions': UnicodeSubset('ɐ-ʯ'),\
 'IsSpacingModifierLetters': UnicodeSubset('ʰ-˿'), 'IsCombiningDiacrit\
icalMarks': UnicodeSubset('̀-ͯ'), 'IsGreek': UnicodeSubset('Ͱ-Ͽ'), 'Is\
Cyrillic': UnicodeSubset('Ѐ-ӿ'), 'IsArmenian': UnicodeSubset('԰-֏'), '\
IsHebrew': UnicodeSubset('֐-׿'), 'IsArabic': UnicodeSubset('؀-ۿ'), 'Is\
Syriac': UnicodeSubset('܀-ݏ'), 'IsThaana': UnicodeSubset('ހ-޿'), 'IsDe\
...