Skip to main content
The validation pipeline is composed of independent strategies, each targeting a specific error type — from tone mark disambiguation to AI-powered semantic analysis. Strategies execute in priority order and share context so later strategies can skip positions already flagged by earlier ones.

Overview

The validation pipeline processes text through multiple strategies, each checking for different error types:
StrategyPriorityPurposeError Type
ToneValidationStrategy10Tone mark disambiguationtone_ambiguity
OrthographyValidationStrategy15Medial order and compatibilitymedial_order_error
SyntacticValidationStrategy20Grammar rule checkingsyntax_error
POSSequenceValidationStrategy30POS sequence validationpos_sequence_error
QuestionStructureValidationStrategy40Question structurequestion_structure
HomophoneValidationStrategy45Homophone detectionhomophone_error
NgramContextValidationStrategy50N-gram probabilitycontext_probability
ErrorDetectionStrategy65AI token classification (opt-in)ai_detected
SemanticValidationStrategy70AI-powered semantic (opt-in)semantic_error
Lower priority values run first.

ValidationContext

All strategies receive a shared ValidationContext containing sentence-level information:
from myspellchecker.core.validation_strategies.base import ValidationContext

context = ValidationContext(
    sentence="သူ သွား ကျောင်း",
    words=["သူ", "သွား", "ကျောင်း"],
    word_positions=[0, 6, 15],
    is_name_mask=[False, False, False],
    existing_errors=set(),  # Positions with errors from previous strategies
    sentence_type="statement",  # statement, question, command
    pos_tags=["PRON", "V", "N"]  # POS tags if available
)

Context Attributes

AttributeTypeDescription
sentencestrFull original sentence
wordsList[str]Tokenized words
word_positionsList[int]Character position of each word
is_name_maskList[bool]True if word is a proper name
existing_errorsSet[int]Word positions already flagged
sentence_typestrSentence type for context
pos_tagsList[str]POS tags (if available)

Strategy Implementations

ToneValidationStrategy (Priority: 10)

Handles tone mark disambiguation using context.
from myspellchecker.core.validation_strategies.tone_strategy import ToneValidationStrategy
from myspellchecker.text.tone import ToneDisambiguator

disambiguator = ToneDisambiguator()
strategy = ToneValidationStrategy(
    tone_disambiguator=disambiguator,
    confidence_threshold=0.5
)

errors = strategy.validate(context)
Detection:
  • Missing tone marks (ငါ → ငါး in number context)
  • Wrong tone marks based on context
  • Ambiguous words resolved by surrounding words

OrthographyValidationStrategy (Priority: 15)

Validates medial consonant ordering and compatibility (UTN #11 rules) at the word level. Automatically included in the default pipeline. Detection:
  • Incorrect medial consonant order (e.g., ွ before ျ)
  • Incompatible medial-consonant combinations

SyntacticValidationStrategy (Priority: 20)

Validates grammar rules and particle usage.
from myspellchecker.core.validation_strategies.syntactic_strategy import SyntacticValidationStrategy

strategy = SyntacticValidationStrategy(
    syntactic_rule_checker=syntactic_checker,
    confidence=0.80
)

errors = strategy.validate(context)
Detection:
  • Particle errors (မှာ vs မှ)
  • Medial confusion (ျ vs ြ)
  • Missing particles
  • Invalid word combinations

POSSequenceValidationStrategy (Priority: 30)

Validates POS tag sequences against expected patterns.
from myspellchecker.core.validation_strategies.pos_sequence_strategy import POSSequenceValidationStrategy

strategy = POSSequenceValidationStrategy(
    viterbi_tagger=pos_tagger,
    confidence=0.70
)

errors = strategy.validate(context)
Detection:
  • P-P: Consecutive particles → error (always flagged)
  • N-N: Consecutive nouns without particle → warning (logged, not surfaced as error)
  • V-V: Consecutive verbs → info (serial verb constructions are usually valid)
Serial Verb Support: Myanmar is a serial verb language where verb-verb (V-V) sequences are often valid. The strategy recognizes valid serial verb constructions:
  • Auxiliary verbs: နေ (progressive), ထား (resultative), လိုက် (action manner)
  • Modal verbs: နိုင် (ability), ချင် (desire), ရ (permission)
  • Directional verbs: သွား (away), လာ (toward)
# "စားသွား" (eat+go = go eat) is a valid V-V sequence
# Strategy checks is_valid_verb_sequence() before flagging V-V as error

QuestionStructureValidationStrategy (Priority: 40)

Validates question sentence structure.
from myspellchecker.core.validation_strategies.question_strategy import QuestionStructureValidationStrategy

strategy = QuestionStructureValidationStrategy(
    confidence=0.75
)

errors = strategy.validate(context)
Detection:
  • Missing question particles (လား, သလဲ)
  • Wrong question particle for context
  • Question word agreement
Enclitic Question Particles: The strategy detects question particles attached directly to verbs (enclitics):
# "သွားလား" (go+question = did you go?) is recognized as a proper question
# No error generated for verb+particle combinations
Negative Indefinite Handling: The strategy correctly identifies negative indefinite constructions as statements, not questions:
# "ဘယ်သူမှ မလာဘူး" = "Nobody came" (statement, NOT question)
# Question word + "မှ" suffix + negative verb = statement pattern

NgramContextStrategy (Priority: 50)

Uses bigram/trigram probabilities to detect unlikely sequences.
from myspellchecker.core.validation_strategies.ngram_strategy import NgramContextValidationStrategy

strategy = NgramContextValidationStrategy(
    context_checker=ngram_checker,
    provider=provider,
    confidence_high=0.75,
    confidence_low=0.6,
    max_suggestions=5,
    edit_distance=2
)

errors = strategy.validate(context)
Detection:
  • Low probability word pairs
  • Unusual word combinations
  • Real-word errors (correct spelling, wrong context)

HomophoneStrategy (Priority: 45)

Detects homophone confusion based on context.
from myspellchecker.core.validation_strategies.homophone_strategy import HomophoneValidationStrategy

strategy = HomophoneValidationStrategy(
    homophone_checker=homophone_checker,
    provider=ngram_provider,
    confidence=0.80,
    improvement_ratio=5.0,  # Require 5x better probability
    min_probability=0.001    # Minimum threshold to prevent false positives
)

errors = strategy.validate(context)
Detection:
  • Homophone pairs (ကား/ကာ, သာ/သား)
  • Context-based correct form selection
  • Sound-alike word confusion
Minimum Probability Threshold: The min_probability parameter prevents false positives from infrequent n-gram occurrences. When the current word has zero probability (unseen n-gram), a homophone is only suggested if its probability exceeds this threshold:
# With min_probability=0.001:
# - Homophone with prob 0.01 → suggested (above threshold)
# - Homophone with prob 0.0001 → NOT suggested (below threshold)

ErrorDetectionStrategy (Priority: 65) — Opt-in Required

AI-powered error detection using token classification. Unlike the MLM-based SemanticChecker (which masks each word and requires N forward passes), this strategy classifies all tokens in a single forward pass (~10ms), making it practical for real-time use. This strategy is not active by default. You must train a detector model first, then configure ErrorDetectorConfig with the model path.
from myspellchecker.core.validation_strategies.error_detection_strategy import ErrorDetectionStrategy

strategy = ErrorDetectionStrategy(error_detector=error_detector)

errors = strategy.validate(context)
Detection:
  • Token-level error classification (CORRECT vs ERROR)
  • Single forward pass for entire sentence
  • Complements N-gram and semantic strategies
How it differs from SemanticValidationStrategy:
AspectErrorDetectionStrategy (65)SemanticValidationStrategy (70)
ApproachToken classificationMasked Language Modeling
Speed~10ms (single pass)~50-150ms × N words
OutputError flags onlyError flags + suggestions
ModelFine-tuned XLM-RoBERTaUser-trained RoBERTa/BERT
TrainingRequires train-detectorRequires train-model
Empty suggestions: ErrorDetector only detects errors, not corrections. When both ErrorDetector and SemanticChecker are configured, the SemanticChecker at priority 70 can provide suggestions for AI-detected positions.

SemanticValidationStrategy (Priority: 70) — Opt-in Required

AI-powered validation using ONNX models. This strategy is not active by default. You must train a semantic model first, then configure SemanticConfig with the model path and set use_proactive_scanning=True.
from myspellchecker.core.validation_strategies.semantic_strategy import SemanticValidationStrategy

strategy = SemanticValidationStrategy(
    semantic_checker=semantic_checker,
    proactive_confidence_threshold=0.6
)

errors = strategy.validate(context)
Detection:
  • Semantic anomalies
  • Word meaning in context
  • Deep contextual errors

Creating Custom Strategies

Implement the ValidationStrategy abstract base class:
from myspellchecker.core.validation_strategies.base import (
    ValidationStrategy,
    ValidationContext
)
from myspellchecker.core.response import Error, ContextError

class CustomValidationStrategy(ValidationStrategy):
    """Custom validation strategy."""

    def __init__(self, config: dict):
        self.config = config

    def validate(self, context: ValidationContext) -> list[Error]:
        """Validate and return errors."""
        errors = []

        for i, word in enumerate(context.words):
            # Skip if already has an error
            if context.word_positions[i] in context.existing_errors:
                continue

            # Skip proper names
            if i < len(context.is_name_mask) and context.is_name_mask[i]:
                continue

            # Your validation logic
            if self._is_invalid(word, context):
                errors.append(ContextError(
                    text=word,
                    position=context.word_positions[i],
                    error_type="custom_error",
                    suggestions=self._get_suggestions(word),
                    confidence=0.80,
                    probability=0.0,
                    prev_word=context.words[i-1] if i > 0 else ""
                ))

                # Mark as having error
                context.existing_errors.add(context.word_positions[i])

        return errors

    def priority(self) -> int:
        """Return priority (lower runs first)."""
        return 45  # Between POS and N-gram

    def _is_invalid(self, word: str, context: ValidationContext) -> bool:
        # Implement validation logic
        return False

    def _get_suggestions(self, word: str) -> list[str]:
        # Generate suggestions
        return []

Strategy Composition

In the default pipeline, SpellChecker coordinates validation directly through its validators:
  1. SyllableValidator — validates each syllable (layer 1)
  2. WordValidator — validates words via SymSpell (layer 2)
  3. ContextValidator — orchestrates validation strategies (layer 3)
The ContextValidator receives a list of strategies built by SpellCheckerBuilder and executes them in priority order within each sentence.
from myspellchecker.core.builder import SpellCheckerBuilder

# Builder wires strategies automatically based on config
checker = SpellCheckerBuilder(config).with_provider(provider).build()
result = checker.check("မြန်မာ စာ")

Execution Order

  1. Strategies are sorted by priority (ascending)
  2. Each strategy receives the shared ValidationContext
  3. Strategies can check existing_errors to skip already-flagged words
  4. Strategies add their flagged positions to existing_errors
  5. Errors from all strategies are collected and returned

Configuration

Enable/disable strategies via configuration:
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.core.config.validation_configs import ValidationConfig

config = SpellCheckerConfig(
    use_context_checker=True,  # Enable N-gram strategy
    use_phonetic=True,         # Enable homophone detection
    validation=ValidationConfig(
        use_homophone_detection=True,      # Toggle homophone strategy (default: True)
        use_orthography_validation=True,   # Toggle orthography strategy (default: True)
        enable_strategy_timing=False,      # Per-strategy timing at DEBUG level (default: False)
    ),
    # Semantic config enables semantic strategy (opt-in, requires trained model)
    semantic=SemanticConfig(
        model_path="./my-model/model.onnx",       # Your trained model
        use_proactive_scanning=True,
    ),
    # Error detector config enables error detection strategy (opt-in, requires trained model)
    error_detector=ErrorDetectorConfig(
        model_path="./detector/onnx/model.onnx",   # Your trained detector
        tokenizer_path="./detector/onnx",
        confidence_threshold=0.7,
    ),
)

Error Types

Each strategy produces specific error types:
Error TypeStrategyDescription
tone_ambiguityToneTone mark disambiguation
syntax_errorSyntacticGrammar rule violation
pos_sequence_errorPOSInvalid POS sequence (P-P)
question_structureQuestionQuestion structure issue
homophone_errorHomophoneSound-alike confusion
context_probabilityN-gramLow probability sequence
ai_detectedError DetectionAI-flagged error (token classification)
semantic_errorSemanticAI-detected anomaly (opt-in)

Best Practices

  1. Priority Selection: Choose priorities that make sense for your validation order
  2. Skip Flagged Words: Always check existing_errors to avoid duplicate errors
  3. Skip Names: Respect the is_name_mask to avoid flagging proper names
  4. Confidence Scores: Use appropriate confidence levels for your error type
  5. Performance: Heavy validations (semantic) should run last

See Also