Text Validation

The text validator checks Myanmar text for structural correctness — invalid character ordering, encoding artifacts (Zawgyi remnants), doubled diacritics, and other issues that indicate malformed input rather than spelling errors.

Overview

from myspellchecker.text.validator import validate_text, validate_word, ValidationIssue

# Validate a word (returns bool)
is_valid = validate_word("ကျောင်း")
if is_valid:
    print("Word is valid")

# Validate full text (returns ValidationResult)
result = validate_text("မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။")
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.value}: {description}")

ValidationIssue Enum

All validation issues are categorized using the ValidationIssue enum:

Issue	Description	Category
`EXTENDED_MYANMAR`	Contains Extended Myanmar/Shan/Mon/Karen characters (U+1050-U+109F, Extended-A/B)	Encoding
`ZAWGYI_YA_ASAT`	Zawgyi ya-medial used as pseudo-asat (e.g., ငျး)	Encoding
`ZAWGYI_YA_TERMINAL`	Zawgyi ya-medial at word-final position	Encoding
`ZAWGYI_YA_RA`	Zawgyi ya+ra medial combination	Encoding
`ASAT_BEFORE_VOWEL`	Asat (်) appears before a vowel sign (invalid ordering)	Structural
`INCOMPLETE_VOWEL`	Incomplete vowel pattern (e.g., vowel before asat, missing u-vowel in O-vowel)	Structural
`DIGIT_TONE`	Myanmar digit followed by tone mark	Structural
`SCRAMBLED_ORDER`	Scrambled character sequence (e.g., vowel-asat-vowel)	Structural
`INVALID_START`	Word starts with invalid character (not consonant, independent vowel, or digit)	Structural
`DOUBLED_DIACRITIC`	Doubled vowel, medial, or invalid tone sequence	Structural
`VIRAMA_AT_END`	Virama (္) at end of word (incomplete stacking)	Structural
`EMPTY_OR_WHITESPACE`	Empty or whitespace-only input	Structural
`KNOWN_INVALID`	Word is in the curated known-invalid words list	Quality
`FRAGMENT_PATTERN`	Segmentation fragment (consonant + asat/tone only)	Segmentation
`DOUBLE_ENDING`	Double-ending artifact (e.g., valid word + fragment merged)	Segmentation
`INCOMPLETE_WORD`	Incomplete word (ends with medial, incomplete stacking, or bare consonant after medial)	Segmentation
`MIXED_LETTER_NUMERAL`	Mixed Myanmar letter and numeral (should be split)	Quality
`ASAT_INITIAL`	Asat-initial fragment (consonant+asat at word start)	Segmentation
`COMPOUND_TRUNCATED`	Compound word with truncated ending	Quality
`MISSING_E_VOWEL`	Missing ေ in ောင pattern (common typo)	Quality
`PURE_NUMERAL`	Pure Myanmar numeral sequence (not a word)	Quality
`DOUBLED_CONSONANT`	Two identical consonants only (segmentation artifact)	Quality
`INVALID_VOWEL_SEQUENCE_SYLLABLE`	Invalid vowel sequence (e.g., doubled i-vowels, ာု)	Structural
`BARE_CONSONANT_END`	Word ends with bare consonant without asat	Segmentation
`STACKED_CONSONANT_START`	Word starts with stacked consonant marker (္)	Segmentation
`MEDIAL_START`	Word starts with a medial (ျ ြ ွ ှ)	Segmentation
`DEPENDENT_VOWEL_START`	Word starts with a dependent vowel sign	Segmentation
`GREAT_SA_START`	Word starts with Great Sa (ဿ)	Segmentation
`ASAT_ANUSVARA_SEQUENCE`	Contains phonetically impossible ်ံ sequence	Segmentation
`DOUBLED_INDEPENDENT_VOWEL`	Two identical independent vowels (OCR error)	Segmentation

Core Functions

validate_word

Quick boolean validation check for a single word:

from myspellchecker.text.validator import validate_word

# Check a valid word (returns bool)
is_valid = validate_word("မြန်မာ")
print(is_valid)  # True

# Check word with Zawgyi artifacts
is_valid = validate_word("ေကာင္း")  # Zawgyi encoding
print(is_valid)  # False

# Check invalid syllable
is_valid = validate_word("ျက")  # Invalid start
print(is_valid)  # False

validate_text

Validates text and returns detailed issue information:

from myspellchecker.text.validator import validate_text

text = "မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင် တည်ရှိသည်။"
result = validate_text(text)

# ValidationResult has: is_valid, issues, cleaned_text
if not result.is_valid:
    for issue, description in result.issues:
        print(f"{issue.name}: {description}")

Validation Categories

Structural Validation

Checks Myanmar character structure rules using validate_text for detailed issues:

from myspellchecker.text.validator import validate_text

# Asat before vowel check
result = validate_text("ကျွန်ုပ်")  # Asat before vowel sign
# result.issues may contain (ValidationIssue.ASAT_BEFORE_VOWEL, "Asat before vowel: ်ု")

# Doubled diacritic check
result = validate_text("ကာါ")  # Doubled vowel signs
# result.issues may contain (ValidationIssue.DOUBLED_DIACRITIC, "Doubled vowel: ...")

# Virama at end of word
result = validate_text("က္")  # Incomplete stacking
# result.issues may contain (ValidationIssue.VIRAMA_AT_END, "Virama at word end")

Encoding Detection

Detects legacy Zawgyi encoding:

# Zawgyi detection patterns
ZAWGYI_PATTERNS = [
    "ေ" + consonant,  # Zawgyi vowel-first
    "္" followed by wrong char,  # Invalid stacking
    "\u1033",  # Zawgyi-specific codepoint
]

is_valid = validate_word("ေကာင္း")
# Returns: False (contains Zawgyi artifacts)

Quality Filters

Detects low-quality or incomplete words:

from myspellchecker.text.validator import (
    is_fragment_pattern,
    is_incomplete_word,
    is_truncated_word,
    is_quality_word,
)

# Fragment detection (returns Tuple[bool, Optional[str]])
is_frag, reason = is_fragment_pattern("င်း")  # (True, "description")

# Incomplete word detection (returns Tuple[bool, Optional[str]])
is_inc, reason = is_incomplete_word("ကျော")  # (True, "description")

# Truncation detection (frequency-based, second arg is a callable)
is_truncated_word("ချိန", lambda word: freq_dict.get(word, 0))  # (True, 'ချိန်')

# Overall quality check
is_quality_word("ကျောင်း")  # True - high quality

Known Invalid Words

A curated list of ~50 verified invalid words that commonly appear in corpora:

from myspellchecker.text.validator import KNOWN_INVALID_WORDS

# Example invalid words (from the set)
KNOWN_INVALID_WORDS = {
    "သည်မ",      # Truncated
    "သည်င်း",    # Invalid merge
    "ကို့",       # Invalid tone
    "တွင့်",      # Invalid ending
    # ... ~50 total
}

# Check if word is known invalid
if word in KNOWN_INVALID_WORDS:
    issues.append(ValidationIssue.KNOWN_INVALID)

Valid Pali/Sanskrit Endings

Whitelist of ~80 words with valid bare consonant endings (Pali/Sanskrit loanwords):

from myspellchecker.text.validator import VALID_PALI_BARE_ENDINGS

# Example valid Pali endings
VALID_PALI_BARE_ENDINGS = {
    "ဗုဒ္ဓ",     # Buddha
    "သံဃာ",     # Sangha
    "ဓမ္မ",      # Dhamma
    # ... ~80 total
}

# Used to avoid false positives on religious/formal terms

Extended Myanmar Detection

Detects Myanmar Extended-A and Extended-B characters:

# Extended ranges
EXTENDED_A = range(0xAA60, 0xAA80)  # U+AA60-AA7F
EXTENDED_B = range(0xA9E0, 0xA9FF)  # U+A9E0-A9FF

# These are used in minority languages (Shan, Mon, etc.)
is_valid = validate_word("ꩮꩯꩰ")
# Returns: False (contains Extended Myanmar characters)

# Use validate_text for detailed issue information
result = validate_text("ꩮꩯꩰ")
# result.issues contains (ValidationIssue.EXTENDED_MYANMAR, "Extended Myanmar char: ...")

Integration with SpellChecker

The validation module integrates with the main spell checker:

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_rule_based_validation=True,  # Enable structural validation
    validation=ValidationConfig(
        use_zawgyi_detection=True,   # Enable Zawgyi detection
        strict_validation=True,         # Enable strict validation
    )
)

checker = SpellChecker(config=config)
result = checker.check("မြန်မာစာ")

# Structural issues are reported in result.errors
for error in result.errors:
    if "structural" in str(error.error_type):
        print(f"Structural issue: {error.text}")

Data Pipeline Integration

Used in the data pipeline to filter corpus words:

from myspellchecker.text.validator import validate_word, validate_text, is_quality_word

def filter_corpus(words):
    """Filter corpus to only include quality words."""
    quality_words = []
    for word in words:
        # validate_word returns bool (True if valid)
        is_valid = validate_word(word)

        if is_valid and is_quality_word(word):
            quality_words.append(word)

    return quality_words

Performance

Operation	Time	Notes
`validate_word`	<1ms	Single word validation
`validate_text`	~10ms/1K words	Batch validation
Pattern matching	<0.1ms	Compiled regex

Use Cases

Corpus Cleaning

# Clean corpus before building dictionary
from myspellchecker.text.validator import validate_word, validate_text, ValidationIssue

def clean_corpus(words):
    cleaned = []
    for word in words:
        # validate_word returns bool; use validate_text for detailed issues
        if validate_word(word):
            cleaned.append(word)
        else:
            # For finer control, use validate_text to inspect specific issues
            result = validate_text(word)
            low_severity_only = all(
                issue in {ValidationIssue.EXTENDED_MYANMAR, ValidationIssue.PURE_NUMERAL}
                for issue, _ in result.issues
            )
            if low_severity_only:
                cleaned.append(word)

    return cleaned

Quality Reporting

from collections import Counter
from myspellchecker.text.validator import validate_text

def quality_report(text):
    # validate_text returns a ValidationResult with is_valid and issues
    result = validate_text(text)

    issue_counts = Counter()
    for issue, description in result.issues:
        issue_counts[issue.name] += 1

    print("Quality Report:")
    for issue_name, count in issue_counts.most_common():
        print(f"  {issue_name}: {count}")

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

Overview

ValidationIssue Enum

Core Functions

validate_word

validate_text

Validation Categories

Structural Validation

Encoding Detection

Quality Filters

Known Invalid Words

Valid Pali/Sanskrit Endings

Extended Myanmar Detection

Integration with SpellChecker

Data Pipeline Integration

Performance

Use Cases

Corpus Cleaning

Quality Reporting

See Also

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Overview

​ValidationIssue Enum

​Core Functions

​validate_word

​validate_text

​Validation Categories

​Structural Validation

​Encoding Detection

​Quality Filters

​Known Invalid Words

​Valid Pali/Sanskrit Endings

​Extended Myanmar Detection

​Integration with SpellChecker

​Data Pipeline Integration

​Performance

​Use Cases

​Corpus Cleaning

​Quality Reporting

​See Also

Overview

ValidationIssue Enum

Core Functions

validate_word

validate_text

Validation Categories

Structural Validation

Encoding Detection

Quality Filters

Known Invalid Words

Valid Pali/Sanskrit Endings

Extended Myanmar Detection

Integration with SpellChecker

Data Pipeline Integration

Performance

Use Cases

Corpus Cleaning

Quality Reporting

See Also