“သူ စား ချင်တယ်။” (He wants to eat.) — Correct “သူ စာ ချင်တယ်။” (He letter wants-to — ungrammatical; ချင် requires a preceding verb) — Spelling correct, context incorrectAnother example:
“ကျောင်း သား တစ်ယောက်” (A student) — Correct “ကျောင်း သာ တစ်ယောက်” (School only one person) — Grammatically awkward
Syntactic Grammar Checking
This layer uses Part-of-Speech (POS) tagging and deterministic rules to catch grammatical errors that statistical models might miss due to data sparsity.How it Works
- POS Tagging: Every word in the dictionary can optionally have a POS tag (e.g.,
Nfor Noun,Vfor Verb,Pfor Particle). - Rule Engine: A set of linguistic rules defines valid and invalid sequences.
Example Rules
-
Verb + Particle Agreement:
- Invalid:
သွား(Go/Verb) +ကျောင်း(School/Noun) → “Go School” (Grammatically awkward, missing particle) - Correction:
ကျောင်းသွားorကျောင်းကို သွား→ “Go to school”
- Invalid:
-
Nominalizer Particle
ကြောင်းvs Nounကျောင်း:သွားကြောင်း ပြောတယ်(Said that [he] went) — Correct:ကြောင်းnominalizes the verbသွားကျောင်း ပြောတယ်— Invalid:ကျောင်း(school) cannot follow a verb directly
-
Particle Selection (
မှာvsမှ):ရုံးမှာ ရှိတယ်(Is at the office) — Correct:မှာindicates locationရုံးမှ လာတယ်(Came from the office) — Correct:မှindicates origin- Rule: After Noun,
မှာtypically means “at/in”.မှtypically means “from” or marks conditional.
-
Subject Marker Agreement (
ကvsကို):သူက စာအုပ်ကို ဖတ်တယ်(He reads the book) — Correctသူကို စာအုပ်က ဖတ်တယ်(The book reads him) — Semantically incorrect- Rule: Animate subjects typically take
က; objects takeကို
-
Question Particle Matching:
ဘယ်သူလဲ(Who is it?) — Correct:ဘယ်question word +လဲparticleဘယ်သူလား— Also valid but different nuance (softer question)ဘာလဲvsဘာပဲ— Different meanings: “What?” vs “Whatever”
N-gram Probability
mySpellChecker uses N-gram models (Bigrams and Trigrams) to calculate the probability of word sequences.- Bigram: Probability of Word B following Word A ().
- Trigram: Probability of Word C following A and B ().
The Algorithm
-
Detection:
When the checker encounters a sequence of words, it queries the database for the frequency of that sequence.
If is below
bigram_threshold, the word is flagged as suspicious. -
Correction:
The system generates candidates for the suspicious word (using SymSpell or Phonetic matching).
It then re-calculates probabilities for each candidate in the sentence.
Example: Input sentence “သူ စာ ချင်တယ်” (suspicious word: စာ)
- Candidate “စား” (eat): (High — common verb after pronoun)
- Candidate “စာ” (letter): (Low — less common as standalone)
Advanced Strategies
The N-gram checker employs several heuristics to handle unseen data and improve accuracy:1. Backoff Smoothing (Unigram Check)
If a bigram probability is zero (unseen sequence), the checker looks at the unigram frequency of the word.- If the word is very common globally (high unigram frequency), we assume it is likely correct but used in a novel context. It is not flagged as an error.
- If the word is rare, it is more likely to be a typo.
- Input:
မြန်မာ ဂီတ(Myanmar music) — bigram unseen in corpus ဂီတhas high unigram frequency (common word for “music”)- Result: Not flagged as error, assumed to be valid novel combination
2. Typo Heuristic
For unseen rare words, the checker searches for “neighbors” (words with Edit Distance = 1) that fit the context with high probability.- If a neighbor has a high bigram probability (), we assume the current word is a typo of that neighbor and flag it.
- Input:
စာအုပ် ဖတ်တတ်(rare/unseen wordဖတ်တတ်) - Neighbor found:
ဖတ်တယ်(reads) — Edit Distance = 1 - (high bigram probability)
- Result: Flag
ဖတ်တတ်as likely typo ofဖတ်တယ်
Tone Disambiguation
In Myanmar language, tone marks ( ့, း) drastically change the meaning of a word. Many spelling errors involve missing or incorrect tone marks (e.g., ငါ vs ငါး).
The ToneDisambiguator module uses a specialized context window to resolve these ambiguities.
How it Works
It maintains a list of Ambiguous Groups (e.g., the “Three Tones of Ka”). When it encounters a word from such a group, it checks the surrounding +/- 3 words against a set of context patterns. Example 1:သံ (Sound/Iron) vs သုံး (Three)
- Input:
သံ ယောက်(Iron person?) - Context:
ယောက်(classifier for people) follows the word. - Pattern Match: The pattern
("ယောက်", "ခု", "လုံး")is associated with the numberသုံး(Three). - Correction:
သုံး ယောက်(Three people).
ငါ (I/me) vs ငါး (Fish/Five)
- Input:
ငါ ကောင်(I animal?) - Context:
ကောင်(classifier for animals) follows the word. - Pattern Match: Classifiers for counting animals/fish follow numbers.
- Correction:
ငါး ကောင်(Five animals/fish).
စ (Beginning) vs စ့ (Pierce) vs စာ (Letter)
- Input:
အစာ စားvsအစ စား - Context:
စား(eat) follows — eating requires food (အစာ) - Correction:
အစာ စား(Eat food) — notအစ စား(Eat beginning)
ကြ (Plural marker) vs ကြီး (Big)
- Input:
သူတို့ သွားကြီး(They go big?) - Context:
သူတို့(they) is a plural pronoun, expects plural verb marker - Correction:
သူတို့ သွားကြ(They go) — plural markerကြafter verb
Common Myanmar Grammar Errors Detected
| Error Type | Example (Incorrect) | Correction | Rule Applied |
|---|---|---|---|
| Homophone confusion | စာ ချင်တယ် | စား ချင်တယ် | Context: verb pattern |
| Missing particle | ကျောင်း သွား | ကျောင်းကို သွား | Verb requires object marker |
| Wrong particle | ရုံးမှာ လာတယ် | ရုံးမှ လာတယ် | Motion verb needs မှ (from) |
| Tone mark error | သုံ ယောက် | သုံး ယောက် | Classifier context |
| Plural marker | သူတို့ သွားတယ် | သူတို့ သွားကြတယ် | Plural subject agreement |
| Nominalizer | သွားကျောင်း | သွားကြောင်း | Verb nominalization |