Skip to main content
mySpellChecker uses YAML configuration files to define linguistic rules for Myanmar language. These files are located in src/myspellchecker/rules/ and can be customized for specific use cases.

Rule Files Overview

FilePurposeEntries
particles.yamlLinguistic particles with POS tags91
typo_corrections.yamlCommon typo patterns68
morphology.yamlSuffix/prefix patterns100
grammar_rules.yamlGrammar validation rules45
aspects.yamlVerb aspect markers48
compounds.yamlCompound word patterns63
classifiers.yamlNumeral classifiers74
negation.yamlNegation patterns55
register.yamlFormal/colloquial mappings38
tone_rules.yamlTone mark rules57
homophones.yamlHomophone pairs115
ambiguous_words.yamlMulti-POS words54
pos_inference.yamlPOS inference patterns94
pronouns.yamlPronoun definitions30

File Structure

All rule files follow a common structure:
version: "1.0.0"
category: "category_name"
description: "Description of the rule file"

metadata:
  created_date: "2025-12-30"
  last_updated: "2025-12-30"
  total_entries: 100
  source: "Source description"

# Main content section
rules:
  - ...

Particles (particles.yaml)

Defines Myanmar linguistic particles organized by syntactic function.

Structure

particles:
  verbs:
    tense:
      - particle: "ခဲ့"
        pos_tag: "P_PAST"
        type: "past_tense"
        meaning: "Past tense marker"
        formality: "neutral"
        examples:
          - correct: "သွားခဲ့တယ်"
            translation: "went"
        confidence: 0.98

    aspect:
      - particle: "နေ"
        pos_tag: "P_PROG"
        type: "progressive"
        meaning: "Progressive aspect"
        formality: "neutral"
        confidence: 0.98

POS Tags

TagDescriptionExample
P_PASTPast tenseခဲ့
P_FUTFuture tenseမယ်, မည်
P_PROGProgressiveနေ
P_PERFPerfectiveပြီ
P_SUBJSubject markerက
P_OBJObject markerကို
P_LOCLocativeမှာ, တွင်
P_SENTSentence endingတယ်, သည်
P_MODModifierရဲ့, ၏

Formality Levels

  • colloquial - Spoken/informal
  • neutral - Both formal and informal
  • formal - Written/formal
  • polite - Respectful register
  • literary - Literary style

Typo Corrections (typo_corrections.yaml)

Defines common Myanmar typo patterns with corrections.

Structure

corrections:
  particles:
    - incorrect: "မာ"
      correct: "မှာ"
      error_type: "missing_ha_htoe"
      context: "after_noun"
      excluded_pos: ["ADJ"]
      meaning: "Location particle"
      confidence: 0.92
      examples:
        incorrect: "အိမ်မာ ရှိတယ်"
        correct: "အိမ်မှာ ရှိတယ်"

  medial_confusions:
    - incorrect: "ကျောင်း"
      correct: "ကြောင်း"
      error_type: "ya_pin_ra_yit"
      context: "after_verb"
      pos_constraint:
        preceding: ["V"]
      confidence: 0.95

Error Types

TypeDescription
missing_ha_htoeMissing ှ modifier
character_confusionSimilar looking characters
ya_pin_ra_yitျ vs ြ confusion
missing_asatMissing ် marker
tone_mark_errorWrong or missing tone mark
visual_similarOCR-type errors

Context Types

  • after_noun - Follows a noun
  • after_verb - Follows a verb
  • context_dependent - Requires context analysis
  • standalone - Independent of context

Morphology (morphology.yaml)

Defines suffix and prefix patterns for POS inference.

Structure

suffixes:
  verb_suffixes:
    - suffix: "ခဲ့"
      pos: "V"
      meaning: "past tense"
      confidence: 0.9

    - suffix: "သည်"
      pos: "P_SENT"
      meaning: "formal sentence ending"
      confidence: 0.95

  noun_suffixes:
    - suffix: "များ"
      pos: "N"
      meaning: "plural"
      confidence: 0.9

  adverb_suffixes:
    - suffix: "စွာ"
      pos: "ADV"
      meaning: "manner"
      confidence: 0.85

Aspects (aspects.yaml)

Defines verb aspect markers.

Structure

markers:
  - marker: "ပြီ"
    category: "completion"
    description: "Action completed"
    can_combine: false
    register: "neutral"
    is_final: true

  - marker: "နေ"
    category: "progressive"
    description: "Ongoing action"
    can_combine: true
    register: "neutral"
    is_final: false

combinations:
  - sequence: ["ပြီး", "သွား"]
    description: "Completed and went"

invalid_sequences:
  - sequence: ["ပြီ", "ပြီ"]
    reason: "Duplicate completion marker"

typos:
  - incorrect: "ပရီ"
    correct: "ပြီ"

Classifiers (classifiers.yaml)

Defines numeral classifiers for counting.

Structure

classifiers:
  people:
    - word: "ယောက်"
      description: "For people"
      examples: ["လူ", "ကလေး", "လူကြီး"]

  animals:
    - word: "ကောင်"
      description: "For animals"
      examples: ["ခွေး", "ကြောင်", "ငါး"]

  flat_objects:
    - word: "ရွက်"
      description: "For flat objects"
      examples: ["စာရွက်", "အရွက်"]

  round_objects:
    - word: "လုံး"
      description: "For round objects"
      examples: ["ပန်းသီး", "ဘောလုံး"]

Register (register.yaml)

Maps formal and colloquial equivalents.

Structure

register_pairs:
  - formal: "သည်"
    colloquial: "တယ်"
    category: "sentence_ending"

  - formal: "၏"
    colloquial: "ရဲ့"
    category: "possessive"

  - formal: "တွင်"
    colloquial: "မှာ"
    category: "locative"

formal_words:
  - "သည်"
  - "၏"
  - "နှင့်"

colloquial_words:
  - "တယ်"
  - "ရဲ့"
  - "နဲ့"

Negation (negation.yaml)

Defines negation patterns.

Structure

prefix: "မ"

endings:
  ဘူး:
    type: "standard_negative"
    description: "Colloquial negative ending"
    register: "colloquial"

  ပါဘူး:
    type: "polite_negative"
    description: "Polite negative ending"
    register: "polite"

  နဲ့:
    type: "prohibition"
    description: "Don't! (prohibition)"
    register: "colloquial"

  ပါ:
    type: "formal_negative"
    description: "Formal negative ending"
    register: "formal"

typo_map:
  ဘူ: "ဘူး"
  ဘုး: "ဘူး"

auxiliaries:
  ချင်:
    meaning: "want to"
  နိုင်:
    meaning: "can"
:
    meaning: "able to"

Homophones (homophones.yaml)

Defines homophone pairs for context checking.

Structure

homophones:
  - words: ["ကား", "ကာ"]
    meanings:
      ကား: "car"
      ကာ: "shield/screen"
    disambiguation_context:
      ကား: ["စီး", "မောင်း", "ဝယ်"]
      ကာ: ["ရိုက်", "ခုံ", "မျက်နှာ"]

  - words: ["သာ", "သား"]
    meanings:
      သာ: "merely/pleasant"
      သား: "son"
    disambiguation_context:
      သာ: ["သာသာ", "ယာ", "သာမန်"]
      သား: ["သမီး", "မိသား", "အဖေ"]

Compounds (compounds.yaml)

Defines compound word formations.

Structure

prefixes:
  - prefix: "အ"
    type: "nominalization"
    description: "Noun-forming prefix"

suffixes:
  - suffix: "သူ"
    type: "agent"
    description: "Person who does X"

noun_compounds:
  - components: ["ပန်း", "ခြံ"]
    compound: "ပန်းခြံ"
    meaning: "flower garden"

verb_compounds:
  - components: ["စား", "သောက်"]
    compound: "စားသောက်"
    meaning: "dine/eat and drink"

reduplication:
  - base: "ဖြေး"
    reduplicated: "ဖြေးဖြေး"
    meaning: "slowly"

Custom Configuration

Loading Custom Rules

from myspellchecker.grammar.config import GrammarRuleConfig

# Load from custom YAML file path
config = GrammarRuleConfig(config_path="/path/to/custom/grammar_rules.yaml")

# Access rule data
particles = config.particle_tags
morphology = config.morphology_config
aspects = config.aspects_config

Extending Rules

Add custom entries by creating additional YAML files:
# custom_particles.yaml
version: "1.0.0"
category: "particles"

particles:
  custom:
    - particle: "မိ"
      pos_tag: "P_CUSTOM"
      type: "custom_type"
      meaning: "Custom particle"
      confidence: 0.80

Schema Validation

Rule files are validated against JSON schemas in src/myspellchecker/schemas/:
  • grammar_rules.schema.json
  • morphology.schema.json
  • particles.schema.json
  • typo_corrections.schema.json
  • _common.schema.json

Best Practices

  1. Confidence scores: Use 0.9+ for high-certainty rules, 0.7-0.9 for moderate, below 0.7 for context-dependent
  2. Context constraints: Always specify context when rules are position-dependent
  3. Examples: Include examples for documentation and testing
  4. Version control: Update metadata.last_updated when modifying rules
  5. Testing: Test rule changes with representative corpus data

See Also