Skip to main content
The text validator checks Myanmar text for structural correctness — invalid character ordering, encoding artifacts (Zawgyi remnants), doubled diacritics, and other issues that indicate malformed input rather than spelling errors.

Overview

from myspellchecker.text.validator import validate_text, validate_word, ValidationIssue

# Validate a word (returns bool)
is_valid = validate_word("ကျောင်း")
if is_valid:
    print("Word is valid")

# Validate full text (returns ValidationResult)
result = validate_text("မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။")
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.value}: {description}")

ValidationIssue Enum

All validation issues are categorized using the ValidationIssue enum:
IssueDescriptionCategory
EXTENDED_MYANMARContains Extended Myanmar/Shan/Mon/Karen characters (U+1050-U+109F, Extended-A/B)Encoding
ZAWGYI_YA_ASATZawgyi ya-medial used as pseudo-asat (e.g., ငျး)Encoding
ZAWGYI_YA_TERMINALZawgyi ya-medial at word-final positionEncoding
ZAWGYI_YA_RAZawgyi ya+ra medial combinationEncoding
ASAT_BEFORE_VOWELAsat (်) appears before a vowel sign (invalid ordering)Structural
INCOMPLETE_VOWELIncomplete vowel pattern (e.g., vowel before asat, missing u-vowel in O-vowel)Structural
DIGIT_TONEMyanmar digit followed by tone markStructural
SCRAMBLED_ORDERScrambled character sequence (e.g., vowel-asat-vowel)Structural
INVALID_STARTWord starts with invalid character (not consonant, independent vowel, or digit)Structural
DOUBLED_DIACRITICDoubled vowel, medial, or invalid tone sequenceStructural
VIRAMA_AT_ENDVirama (္) at end of word (incomplete stacking)Structural
EMPTY_OR_WHITESPACEEmpty or whitespace-only inputStructural
KNOWN_INVALIDWord is in the curated known-invalid words listQuality
FRAGMENT_PATTERNSegmentation fragment (consonant + asat/tone only)Segmentation
DOUBLE_ENDINGDouble-ending artifact (e.g., valid word + fragment merged)Segmentation
INCOMPLETE_WORDIncomplete word (ends with medial, incomplete stacking, or bare consonant after medial)Segmentation
MIXED_LETTER_NUMERALMixed Myanmar letter and numeral (should be split)Quality
ASAT_INITIALAsat-initial fragment (consonant+asat at word start)Segmentation
COMPOUND_TRUNCATEDCompound word with truncated endingQuality
MISSING_E_VOWELMissing ေ in ောင pattern (common typo)Quality
PURE_NUMERALPure Myanmar numeral sequence (not a word)Quality
DOUBLED_CONSONANTTwo identical consonants only (segmentation artifact)Quality
INVALID_VOWEL_SEQUENCE_SYLLABLEInvalid vowel sequence (e.g., doubled i-vowels, ာု)Structural
BARE_CONSONANT_ENDWord ends with bare consonant without asatSegmentation
STACKED_CONSONANT_STARTWord starts with stacked consonant marker (္)Segmentation
MEDIAL_STARTWord starts with a medial (ျ ြ ွ ှ)Segmentation
DEPENDENT_VOWEL_STARTWord starts with a dependent vowel signSegmentation
GREAT_SA_STARTWord starts with Great Sa (ဿ)Segmentation
ASAT_ANUSVARA_SEQUENCEContains phonetically impossible ်ံ sequenceSegmentation
DOUBLED_INDEPENDENT_VOWELTwo identical independent vowels (OCR error)Segmentation

Core Functions

validate_word

Quick boolean validation check for a single word:
from myspellchecker.text.validator import validate_word

# Check a valid word (returns bool)
is_valid = validate_word("မြန်မာ")
print(is_valid)  # True

# Check word with Zawgyi artifacts
is_valid = validate_word("ေကာင္း")  # Zawgyi encoding
print(is_valid)  # False

# Check invalid syllable
is_valid = validate_word("ျက")  # Invalid start
print(is_valid)  # False

validate_text

Validates text and returns detailed issue information:
from myspellchecker.text.validator import validate_text

text = "မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။"
result = validate_text(text)

# ValidationResult has: is_valid, issues, cleaned_text
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.name}: {description}")

Validation Categories

Structural Validation

Checks Myanmar character structure rules using validate_text for detailed issues:
from myspellchecker.text.validator import validate_text

# Asat before vowel check
result = validate_text("ကျွန်ုပ်")  # Asat before vowel sign
# result.issues may contain (ValidationIssue.ASAT_BEFORE_VOWEL, "Asat before vowel: ်ု")

# Doubled diacritic check
result = validate_text("ကာါ")  # Doubled vowel signs
# result.issues may contain (ValidationIssue.DOUBLED_DIACRITIC, "Doubled vowel: ...")

# Virama at end of word
result = validate_text("က္")  # Incomplete stacking
# result.issues may contain (ValidationIssue.VIRAMA_AT_END, "Virama at word end")

Encoding Detection

Detects legacy Zawgyi encoding:
# Zawgyi detection patterns
ZAWGYI_PATTERNS = [
    "ေ" + consonant,  # Zawgyi vowel-first
    "္" followed by wrong char,  # Invalid stacking
    "\u1033",  # Zawgyi-specific codepoint
]

is_valid = validate_word("ေကာင္း")
# Returns: False (contains Zawgyi artifacts)

Quality Filters

Detects low-quality or incomplete words:
from myspellchecker.text.validator import (
    is_fragment_pattern,
    is_incomplete_word,
    is_truncated_word,
    is_quality_word,
)

# Fragment detection (returns Tuple[bool, Optional[str]])
is_frag, reason = is_fragment_pattern("င်း")  # (True, "description")

# Incomplete word detection (returns Tuple[bool, Optional[str]])
is_inc, reason = is_incomplete_word("ကျော")  # (True, "description")

# Truncation detection (frequency-based, second arg is a callable)
is_truncated_word("ချိန", lambda word: freq_dict.get(word, 0))  # (True, 'ချိန်')

# Overall quality check
is_quality_word("ကျောင်း")  # True - high quality

Known Invalid Words

A curated list of ~50 verified invalid words that commonly appear in corpora:
from myspellchecker.text.validator import KNOWN_INVALID_WORDS

# Example invalid words (from the set)
KNOWN_INVALID_WORDS = {
    "သည်မ",      # Truncated
    "သည်င်း",    # Invalid merge
    "ကို့",       # Invalid tone
    "တွင့်",      # Invalid ending
    # ... ~50 total
}

# Check if word is known invalid
if word in KNOWN_INVALID_WORDS:
    issues.append(ValidationIssue.KNOWN_INVALID)

Valid Pali/Sanskrit Endings

Whitelist of ~80 words with valid bare consonant endings (Pali/Sanskrit loanwords):
from myspellchecker.text.validator import VALID_PALI_BARE_ENDINGS

# Example valid Pali endings
VALID_PALI_BARE_ENDINGS = {
    "ဗုဒ္ဓ",     # Buddha
    "သံဃာ",     # Sangha
    "ဓမ္မ",      # Dhamma
    # ... ~80 total
}

# Used to avoid false positives on religious/formal terms

Extended Myanmar Detection

Detects Myanmar Extended-A and Extended-B characters:
# Extended ranges
EXTENDED_A = range(0xAA60, 0xAA80)  # U+AA60-AA7F
EXTENDED_B = range(0xA9E0, 0xA9FF)  # U+A9E0-A9FF

# These are used in minority languages (Shan, Mon, etc.)
is_valid = validate_word("ꩮꩯꩰ")
# Returns: False (contains Extended Myanmar characters)

# Use validate_text for detailed issue information
result = validate_text("ꩮꩯꩰ")
# result.issues contains (ValidationIssue.EXTENDED_MYANMAR, "Extended Myanmar char: ...")

Integration with SpellChecker

The validation module integrates with the main spell checker:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_rule_based_validation=True,  # Enable structural validation
    validation=ValidationConfig(
        use_zawgyi_detection=True,   # Enable Zawgyi detection
        strict_validation=True,         # Enable strict validation
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

# Structural issues are reported in result.errors
for error in result.errors:
    if "structural" in str(error.error_type):
        print(f"Structural issue: {error.text}")

Data Pipeline Integration

Used in the data pipeline to filter corpus words:
from myspellchecker.text.validator import validate_word, validate_text, is_quality_word

def filter_corpus(words):
    """Filter corpus to only include quality words."""
    quality_words = []
    for word in words:
        # validate_word returns bool (True if valid)
        is_valid = validate_word(word)

        if is_valid and is_quality_word(word):
            quality_words.append(word)

    return quality_words

Performance

OperationTimeNotes
validate_word<1msSingle word validation
validate_text~10ms/1K wordsBatch validation
Pattern matching<0.1msCompiled regex

Use Cases

Corpus Cleaning

# Clean corpus before building dictionary
from myspellchecker.text.validator import validate_word, validate_text, ValidationIssue

def clean_corpus(words):
    cleaned = []
    for word in words:
        # validate_word returns bool; use validate_text for detailed issues
        if validate_word(word):
            cleaned.append(word)
        else:
            # For finer control, use validate_text to inspect specific issues
            result = validate_text(word)
            low_severity_only = all(
                issue in {ValidationIssue.EXTENDED_MYANMAR, ValidationIssue.PURE_NUMERAL}
                for issue, _ in result.issues
            )
            if low_severity_only:
                cleaned.append(word)

    return cleaned

Quality Reporting

from collections import Counter
from myspellchecker.text.validator import validate_text

def quality_report(text):
    # validate_text returns a ValidationResult with is_valid and issues
    result = validate_text(text)

    issue_counts = Counter()
    for issue, description in result.issues:
        issue_counts[issue.name] += 1

    print("Quality Report:")
    for issue_name, count in issue_counts.most_common():
        print(f"  {issue_name}: {count}")

See Also