Context-Aware & Grammar Validation

Validating individual words is not enough. Errors often occur when a word is spelled correctly but used incorrectly in context. mySpellChecker employs two strategies for this: Syntactic Grammar Checking (Layer 2.5) and N-gram Probability (Layer 3). Example (Myanmar homophone confusion):

“သူ စား ချင်တယ်။” (He wants to eat.) — Correct “သူ စာ ချင်တယ်။” (He letter wants-to — ungrammatical; ချင် requires a preceding verb) — Spelling correct, context incorrect

Another example:

“ကျောင်း သား တစ်ယောက်” (A student) — Correct “ကျောင်း သာ တစ်ယောက်” (School only one person) — Grammatically awkward

Syntactic Grammar Checking

This layer uses Part-of-Speech (POS) tagging and deterministic rules to catch grammatical errors that statistical models might miss due to data sparsity.

How it Works

POS Tagging: Every word in the dictionary can optionally have a POS tag (e.g., N for Noun, V for Verb, P for Particle).
Rule Engine: A set of linguistic rules defines valid and invalid sequences.

Example Rules

Verb + Particle Agreement:
- Invalid: သွား (Go/Verb) + ကျောင်း (School/Noun) → “Go School” (Grammatically awkward, missing particle)
- Correction: ကျောင်းသွား or ကျောင်းကို သွား → “Go to school”
Nominalizer Particle ကြောင်း vs Noun ကျောင်း:
- သွားကြောင်း ပြောတယ် (Said that [he] went) — Correct: ကြောင်း nominalizes the verb
- သွားကျောင်း ပြောတယ် — Invalid: ကျောင်း (school) cannot follow a verb directly
Particle Selection (မှာ vs မှ):
- ရုံးမှာ ရှိတယ် (Is at the office) — Correct: မှာ indicates location
- ရုံးမှ လာတယ် (Came from the office) — Correct: မှ indicates origin
- Rule: After Noun, မှာ typically means “at/in”. မှ typically means “from” or marks conditional.
Subject Marker Agreement (က vs ကို):
- သူက စာအုပ်ကို ဖတ်တယ် (He reads the book) — Correct
- သူကို စာအုပ်က ဖတ်တယ် (The book reads him) — Semantically incorrect
- Rule: Animate subjects typically take က; objects take ကို
Question Particle Matching:
- ဘယ်သူလဲ (Who is it?) — Correct: ဘယ် question word + လဲ particle
- ဘယ်သူလား — Also valid but different nuance (softer question)
- ဘာလဲ vs ဘာပဲ — Different meanings: “What?” vs “Whatever”

N-gram Probability

mySpellChecker uses N-gram models (Bigrams and Trigrams) to calculate the probability of word sequences.

Bigram: Probability of Word B following Word A ( $P(B|A)$ ).
Trigram: Probability of Word C following A and B ( $P(C|A,B)$ ).

The Algorithm

Detection: When the checker encounters a sequence of words, it queries the database for the frequency of that sequence. If $P(Word_i | Word_{i-1})$ is below bigram_threshold, the word is flagged as suspicious.
Correction: The system generates candidates for the suspicious word (using SymSpell or Phonetic matching). It then re-calculates probabilities for each candidate in the sentence. Example: Input sentence “သူ စာ ချင်တယ်” (suspicious word: စာ)
- Candidate “စား” (eat): $P(\text{စား} | \text{သူ}) = 0.08$ (High — common verb after pronoun)
- Candidate “စာ” (letter): $P(\text{စာ} | \text{သူ}) = 0.002$ (Low — less common as standalone)
The system suggests “စား” because it fits the context better with the verb-wanting pattern “ချင်တယ်”.

Advanced Strategies

The N-gram checker employs several heuristics to handle unseen data and improve accuracy:

1. Backoff Smoothing (Unigram Check)

If a bigram probability is zero (unseen sequence), the checker looks at the unigram frequency of the word.

If the word is very common globally (high unigram frequency), we assume it is likely correct but used in a novel context. It is not flagged as an error.
If the word is rare, it is more likely to be a typo.

Example:

Input: မြန်မာ ဂီတ (Myanmar music) — bigram unseen in corpus
ဂီတ has high unigram frequency (common word for “music”)
Result: Not flagged as error, assumed to be valid novel combination

2. Typo Heuristic

For unseen rare words, the checker searches for “neighbors” (words with Edit Distance = 1) that fit the context with high probability.

If a neighbor has a high bigram probability ( $P > \text{threshold} \times 10$ ), we assume the current word is a typo of that neighbor and flag it.

Example:

Input: စာအုပ် ဖတ်တတ် (rare/unseen word ဖတ်တတ်)
Neighbor found: ဖတ်တယ် (reads) — Edit Distance = 1
$P(\text{ဖတ်တယ်} | \text{စာအုပ်}) = 0.15$ (high bigram probability)
Result: Flag ဖတ်တတ် as likely typo of ဖတ်တယ်

Tone Disambiguation

In Myanmar language, tone marks ( ့, း) drastically change the meaning of a word. Many spelling errors involve missing or incorrect tone marks (e.g., ငါ vs ငါး). The ToneDisambiguator module uses a specialized context window to resolve these ambiguities.

How it Works

It maintains a list of Ambiguous Groups (e.g., the “Three Tones of Ka”). When it encounters a word from such a group, it checks the surrounding +/- 3 words against a set of context patterns. Example 1: သံ (Sound/Iron) vs သုံး (Three)

Input: သံ ယောက် (Iron person?)
Context: ယောက် (classifier for people) follows the word.
Pattern Match: The pattern ("ယောက်", "ခု", "လုံး") is associated with the number သုံး (Three).
Correction: သုံး ယောက် (Three people).

Example 2: ငါ (I/me) vs ငါး (Fish/Five)

Input: ငါ ကောင် (I animal?)
Context: ကောင် (classifier for animals) follows the word.
Pattern Match: Classifiers for counting animals/fish follow numbers.
Correction: ငါး ကောင် (Five animals/fish).

Example 3: စ (Beginning) vs စ့ (Pierce) vs စာ (Letter)

Input: အစာ စား vs အစ စား
Context: စား (eat) follows — eating requires food (အစာ)
Correction: အစာ စား (Eat food) — not အစ စား (Eat beginning)

Example 4: ကြ (Plural marker) vs ကြီး (Big)

Input: သူတို့ သွားကြီး (They go big?)
Context: သူတို့ (they) is a plural pronoun, expects plural verb marker
Correction: သူတို့ သွားကြ (They go) — plural marker ကြ after verb

This system operates alongside the N-gram checker but provides higher confidence for specific, well-known ambiguity patterns.

Common Myanmar Grammar Errors Detected

Error Type	Example (Incorrect)	Correction	Rule Applied
Homophone confusion	`စာ ချင်တယ်`	`စား ချင်တယ်`	Context: verb pattern
Missing particle	`ကျောင်း သွား`	`ကျောင်းကို သွား`	Verb requires object marker
Wrong particle	`ရုံးမှာ လာတယ်`	`ရုံးမှ လာတယ်`	Motion verb needs `မှ` (from)
Tone mark error	`သုံ ယောက်`	`သုံး ယောက်`	Classifier context
Plural marker	`သူတို့ သွားတယ်`	`သူတို့ သွားကြတယ်`	Plural subject agreement
Nominalizer	`သွားကျောင်း`	`သွားကြောင်း`	Verb nominalization

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Context-Aware & Grammar Validation

Syntactic Grammar Checking

How it Works

Example Rules

N-gram Probability

The Algorithm

Advanced Strategies

1. Backoff Smoothing (Unigram Check)

2. Typo Heuristic

Tone Disambiguation

How it Works

Common Myanmar Grammar Errors Detected

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Syntactic Grammar Checking

​How it Works

​Example Rules

​N-gram Probability

​The Algorithm

​Advanced Strategies

​1. Backoff Smoothing (Unigram Check)

​2. Typo Heuristic

​Tone Disambiguation

​How it Works

​Common Myanmar Grammar Errors Detected

Syntactic Grammar Checking

How it Works

Example Rules

N-gram Probability

The Algorithm

Advanced Strategies

1. Backoff Smoothing (Unigram Check)

2. Typo Heuristic

Tone Disambiguation

How it Works

Common Myanmar Grammar Errors Detected