Validation Strategies

The validation pipeline is composed of independent strategies, each targeting a specific error type — from tone mark disambiguation to AI-powered semantic analysis. Strategies execute in priority order and share context so later strategies can skip positions already flagged by earlier ones.

Overview

The validation pipeline processes text through multiple strategies, each checking for different error types:

Strategy	Priority	Purpose	Error Type
ToneValidationStrategy	10	Tone mark disambiguation	`tone_ambiguity`
OrthographyValidationStrategy	15	Medial order and compatibility	`medial_order_error`
SyntacticValidationStrategy	20	Grammar rule checking	`syntax_error`
POSSequenceValidationStrategy	30	POS sequence validation	`pos_sequence_error`
QuestionStructureValidationStrategy	40	Question structure	`question_structure`
HomophoneValidationStrategy	45	Homophone detection	`homophone_error`
NgramContextValidationStrategy	50	N-gram probability	`context_probability`
ErrorDetectionStrategy	65	AI token classification (opt-in)	`ai_detected`
SemanticValidationStrategy	70	AI-powered semantic (opt-in)	`semantic_error`

Lower priority values run first.

ValidationContext

All strategies receive a shared ValidationContext containing sentence-level information:

from myspellchecker.core.validation_strategies.base import ValidationContext

context = ValidationContext(
    sentence="သူ သွား ကျောင်း",
    words=["သူ", "သွား", "ကျောင်း"],
    word_positions=[0, 6, 15],
    is_name_mask=[False, False, False],
    existing_errors=set(),  # Positions with errors from previous strategies
    sentence_type="statement",  # statement, question, command
    pos_tags=["PRON", "V", "N"]  # POS tags if available
)

Context Attributes

Attribute	Type	Description
`sentence`	str	Full original sentence
`words`	List[str]	Tokenized words
`word_positions`	List[int]	Character position of each word
`is_name_mask`	List[bool]	True if word is a proper name
`existing_errors`	Set[int]	Word positions already flagged
`sentence_type`	str	Sentence type for context
`pos_tags`	List[str]	POS tags (if available)

Strategy Implementations

ToneValidationStrategy (Priority: 10)

Handles tone mark disambiguation using context.

from myspellchecker.core.validation_strategies.tone_strategy import ToneValidationStrategy
from myspellchecker.text.tone import ToneDisambiguator

disambiguator = ToneDisambiguator()
strategy = ToneValidationStrategy(
    tone_disambiguator=disambiguator,
    confidence_threshold=0.5
)

errors = strategy.validate(context)

Detection:

Missing tone marks (ငါ → ငါး in number context)
Wrong tone marks based on context
Ambiguous words resolved by surrounding words

OrthographyValidationStrategy (Priority: 15)

Validates medial consonant ordering and compatibility (UTN #11 rules) at the word level. Automatically included in the default pipeline. Detection:

Incorrect medial consonant order (e.g., ွ before ျ)
Incompatible medial-consonant combinations

SyntacticValidationStrategy (Priority: 20)

Validates grammar rules and particle usage.

from myspellchecker.core.validation_strategies.syntactic_strategy import SyntacticValidationStrategy

strategy = SyntacticValidationStrategy(
    syntactic_rule_checker=syntactic_checker,
    confidence=0.80
)

errors = strategy.validate(context)

Detection:

Particle errors (မှာ vs မှ)
Medial confusion (ျ vs ြ)
Missing particles
Invalid word combinations

POSSequenceValidationStrategy (Priority: 30)

Validates POS tag sequences against expected patterns.

from myspellchecker.core.validation_strategies.pos_sequence_strategy import POSSequenceValidationStrategy

strategy = POSSequenceValidationStrategy(
    viterbi_tagger=pos_tagger,
    confidence=0.70
)

errors = strategy.validate(context)

Detection:

P-P: Consecutive particles → error (always flagged)
N-N: Consecutive nouns without particle → warning (logged, not surfaced as error)
V-V: Consecutive verbs → info (serial verb constructions are usually valid)

Serial Verb Support: Myanmar is a serial verb language where verb-verb (V-V) sequences are often valid. The strategy recognizes valid serial verb constructions:

Auxiliary verbs: နေ (progressive), ထား (resultative), လိုက် (action manner)
Modal verbs: နိုင် (ability), ချင် (desire), ရ (permission)
Directional verbs: သွား (away), လာ (toward)

# "စားသွား" (eat+go = go eat) is a valid V-V sequence
# Strategy checks is_valid_verb_sequence() before flagging V-V as error

QuestionStructureValidationStrategy (Priority: 40)

Validates question sentence structure.

from myspellchecker.core.validation_strategies.question_strategy import QuestionStructureValidationStrategy

strategy = QuestionStructureValidationStrategy(
    confidence=0.75
)

errors = strategy.validate(context)

Detection:

Missing question particles (လား, သလဲ)
Wrong question particle for context
Question word agreement

Enclitic Question Particles: The strategy detects question particles attached directly to verbs (enclitics):

# "သွားလား" (go+question = did you go?) is recognized as a proper question
# No error generated for verb+particle combinations

Negative Indefinite Handling: The strategy correctly identifies negative indefinite constructions as statements, not questions:

# "ဘယ်သူမှ မလာဘူး" = "Nobody came" (statement, NOT question)
# Question word + "မှ" suffix + negative verb = statement pattern

NgramContextStrategy (Priority: 50)

Uses bigram/trigram probabilities to detect unlikely sequences.

from myspellchecker.core.validation_strategies.ngram_strategy import NgramContextValidationStrategy

strategy = NgramContextValidationStrategy(
    context_checker=ngram_checker,
    provider=provider,
    confidence_high=0.75,
    confidence_low=0.6,
    max_suggestions=5,
    edit_distance=2
)

errors = strategy.validate(context)

Detection:

Low probability word pairs
Unusual word combinations
Real-word errors (correct spelling, wrong context)

HomophoneStrategy (Priority: 45)

Detects homophone confusion based on context.

from myspellchecker.core.validation_strategies.homophone_strategy import HomophoneValidationStrategy

strategy = HomophoneValidationStrategy(
    homophone_checker=homophone_checker,
    provider=ngram_provider,
    confidence=0.80,
    improvement_ratio=5.0,  # Require 5x better probability
    min_probability=0.001    # Minimum threshold to prevent false positives
)

errors = strategy.validate(context)

Detection:

Homophone pairs (ကား/ကာ, သာ/သား)
Context-based correct form selection
Sound-alike word confusion

Minimum Probability Threshold: The min_probability parameter prevents false positives from infrequent n-gram occurrences. When the current word has zero probability (unseen n-gram), a homophone is only suggested if its probability exceeds this threshold:

# With min_probability=0.001:
# - Homophone with prob 0.01 → suggested (above threshold)
# - Homophone with prob 0.0001 → NOT suggested (below threshold)

ErrorDetectionStrategy (Priority: 65) — Opt-in Required

AI-powered error detection using token classification. Unlike the MLM-based SemanticChecker (which masks each word and requires N forward passes), this strategy classifies all tokens in a single forward pass (~10ms), making it practical for real-time use. This strategy is not active by default. You must train a detector model first, then configure ErrorDetectorConfig with the model path.

from myspellchecker.core.validation_strategies.error_detection_strategy import ErrorDetectionStrategy

strategy = ErrorDetectionStrategy(error_detector=error_detector)

errors = strategy.validate(context)

Detection:

Token-level error classification (CORRECT vs ERROR)
Single forward pass for entire sentence
Complements N-gram and semantic strategies

How it differs from SemanticValidationStrategy:

Aspect	ErrorDetectionStrategy (65)	SemanticValidationStrategy (70)
Approach	Token classification	Masked Language Modeling
Speed	~10ms (single pass)	~50-150ms × N words
Output	Error flags only	Error flags + suggestions
Model	Fine-tuned XLM-RoBERTa	User-trained RoBERTa/BERT
Training	Requires `train-detector`	Requires `train-model`

Empty suggestions: ErrorDetector only detects errors, not corrections. When both ErrorDetector and SemanticChecker are configured, the SemanticChecker at priority 70 can provide suggestions for AI-detected positions.

SemanticValidationStrategy (Priority: 70) — Opt-in Required

AI-powered validation using ONNX models. This strategy is not active by default. You must train a semantic model first, then configure SemanticConfig with the model path and set use_proactive_scanning=True.

from myspellchecker.core.validation_strategies.semantic_strategy import SemanticValidationStrategy

strategy = SemanticValidationStrategy(
    semantic_checker=semantic_checker,
    proactive_confidence_threshold=0.6
)

errors = strategy.validate(context)

Detection:

Semantic anomalies
Word meaning in context
Deep contextual errors

Creating Custom Strategies

Implement the ValidationStrategy abstract base class:

from myspellchecker.core.validation_strategies.base import (
    ValidationStrategy,
    ValidationContext
)
from myspellchecker.core.response import Error, ContextError

class CustomValidationStrategy(ValidationStrategy):
    """Custom validation strategy."""

    def __init__(self, config: dict):
        self.config = config

    def validate(self, context: ValidationContext) -> list[Error]:
        """Validate and return errors."""
        errors = []

        for i, word in enumerate(context.words):
            # Skip if already has an error
            if context.word_positions[i] in context.existing_errors:
                continue

            # Skip proper names
            if i < len(context.is_name_mask) and context.is_name_mask[i]:
                continue

            # Your validation logic
            if self._is_invalid(word, context):
                errors.append(ContextError(
                    text=word,
                    position=context.word_positions[i],
                    error_type="custom_error",
                    suggestions=self._get_suggestions(word),
                    confidence=0.80,
                    probability=0.0,
                    prev_word=context.words[i-1] if i > 0 else ""
                ))

                # Mark as having error
                context.existing_errors.add(context.word_positions[i])

        return errors

    def priority(self) -> int:
        """Return priority (lower runs first)."""
        return 45  # Between POS and N-gram

    def _is_invalid(self, word: str, context: ValidationContext) -> bool:
        # Implement validation logic
        return False

    def _get_suggestions(self, word: str) -> list[str]:
        # Generate suggestions
        return []

Strategy Composition

In the default pipeline, SpellChecker coordinates validation directly through its validators:

SyllableValidator — validates each syllable (layer 1)
WordValidator — validates words via SymSpell (layer 2)
ContextValidator — orchestrates validation strategies (layer 3)

The ContextValidator receives a list of strategies built by SpellCheckerBuilder and executes them in priority order within each sentence.

from myspellchecker.core.builder import SpellCheckerBuilder

# Builder wires strategies automatically based on config
checker = SpellCheckerBuilder(config).with_provider(provider).build()
result = checker.check("မြန်မာ စာ")

Execution Order

Strategies are sorted by priority (ascending)
Each strategy receives the shared ValidationContext
Strategies can check existing_errors to skip already-flagged words
Strategies add their flagged positions to existing_errors
Errors from all strategies are collected and returned

Configuration

Enable/disable strategies via configuration:

from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_context_checker=True,  # Enable N-gram strategy
    use_phonetic=True,         # Enable homophone detection
    validation=ValidationConfig(
        use_homophone_detection=True,      # Toggle homophone strategy (default: True)
        use_orthography_validation=True,   # Toggle orthography strategy (default: True)
        enable_strategy_timing=False,      # Per-strategy timing at DEBUG level (default: False)
    ),
    # Semantic config enables semantic strategy (opt-in, requires trained model)
    semantic=SemanticConfig(
        model_path="./my-model/model.onnx",       # Your trained model
        use_proactive_scanning=True,
    ),
    # Error detector config enables error detection strategy (opt-in, requires trained model)
    error_detector=ErrorDetectorConfig(
        model_path="./detector/onnx/model.onnx",   # Your trained detector
        tokenizer_path="./detector/onnx",
        confidence_threshold=0.7,
    ),
)

Error Types

Each strategy produces specific error types:

Error Type	Strategy	Description
`tone_ambiguity`	Tone	Tone mark disambiguation
`syntax_error`	Syntactic	Grammar rule violation
`pos_sequence_error`	POS	Invalid POS sequence (P-P)
`question_structure`	Question	Question structure issue
`homophone_error`	Homophone	Sound-alike confusion
`context_probability`	N-gram	Low probability sequence
`ai_detected`	Error Detection	AI-flagged error (token classification)
`semantic_error`	Semantic	AI-detected anomaly (opt-in)

Best Practices

Priority Selection: Choose priorities that make sense for your validation order
Skip Flagged Words: Always check existing_errors to avoid duplicate errors
Skip Names: Respect the is_name_mask to avoid flagging proper names
Confidence Scores: Use appropriate confidence levels for your error type
Performance: Heavy validations (semantic) should run last

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

Overview

ValidationContext

Context Attributes

Strategy Implementations

ToneValidationStrategy (Priority: 10)

OrthographyValidationStrategy (Priority: 15)

SyntacticValidationStrategy (Priority: 20)

POSSequenceValidationStrategy (Priority: 30)

QuestionStructureValidationStrategy (Priority: 40)

NgramContextStrategy (Priority: 50)

HomophoneStrategy (Priority: 45)

ErrorDetectionStrategy (Priority: 65) — Opt-in Required

SemanticValidationStrategy (Priority: 70) — Opt-in Required

Creating Custom Strategies

Strategy Composition

Execution Order

Configuration

Error Types

Best Practices

See Also

Getting Started

Dictionary Building

Spell Checking

Grammar

Language Processing

AI-Powered Checking

Text Utilities

Performance & Scale

Customization

Integration & Deployment

Help & FAQ

​Overview

​ValidationContext

​Context Attributes

​Strategy Implementations

​ToneValidationStrategy (Priority: 10)

​OrthographyValidationStrategy (Priority: 15)

​SyntacticValidationStrategy (Priority: 20)

​POSSequenceValidationStrategy (Priority: 30)

​QuestionStructureValidationStrategy (Priority: 40)

​NgramContextStrategy (Priority: 50)

​HomophoneStrategy (Priority: 45)

​ErrorDetectionStrategy (Priority: 65) — Opt-in Required

​SemanticValidationStrategy (Priority: 70) — Opt-in Required

​Creating Custom Strategies

​Strategy Composition

​Execution Order

​Configuration

​Error Types

​Best Practices

​See Also

Overview

ValidationContext

Context Attributes

Strategy Implementations

ToneValidationStrategy (Priority: 10)

OrthographyValidationStrategy (Priority: 15)

SyntacticValidationStrategy (Priority: 20)

POSSequenceValidationStrategy (Priority: 30)

QuestionStructureValidationStrategy (Priority: 40)

NgramContextStrategy (Priority: 50)

HomophoneStrategy (Priority: 45)

ErrorDetectionStrategy (Priority: 65) — Opt-in Required

SemanticValidationStrategy (Priority: 70) — Opt-in Required

Creating Custom Strategies

Strategy Composition

Execution Order

Configuration

Error Types

Best Practices

See Also