Skip to main content
Spell checkers often struggle with Proper Nouns (names of people, places, organizations). These words are rarely in the standard dictionary, leading to annoying false positives (red underlines on valid names). myspellchecker includes a lightweight, heuristic-based NER system to detect and skip spell checking for potential named entities.

Approach: Heuristic & Rule-Based

Unlike heavy machine learning models (like BERT/CRF) that require massive labeled datasets, our NER system relies on linguistic cues specific to Myanmar language. This makes it extremely fast and effective for the specific task of “ignoring names”.

1. Honorific Detection

Myanmar names are almost always preceded by an honorific (title). Detecting these titles is a high-precision signal that the next word is a name. Supported Honorifics:
  • ဦး (U) - Mr. (formal/older)
  • ဒေါ် (Daw) - Ms./Mrs.
  • ကို (Ko) - Mr. (younger/informal)
  • (Ma) - Ms.
  • မောင် (Maung) - Master (young male)
  • ဒေါက်တာ (Dr.)
  • ရှင် (Shin) - Monk/Novice
  • ဗိုလ် (Bo) - Officer
Logic: If Word WiW_i is an honorific, then Wi+1W_{i+1} (and potentially Wi+2W_{i+2}) is treated as a Name and excluded from error flagging.

2. Regex Patterns

The system also whitelists non-dictionary tokens that follow specific patterns:
  • English Words: [A-Za-z]+ (e.g., “GPS”, “TV”)
  • Numbers: Standard digits (0-9) and Myanmar digits (၀-၉).
  • Dates/Symbols: 12/12/2024, ISO-8859, etc.

3. Common Name Syllables

A “soft” heuristic checks for syllables that appear frequently in names but rarely in other contexts (e.g., သီ, နွယ်, ဆွေ in certain positions). This is used in conjunction with context checks.

Usage

NER is enabled by default in SpellCheckerConfig.
from myspellchecker.core.config import SpellCheckerConfig

# Enabled by default
config = SpellCheckerConfig(use_ner=True)

# Disable if you want to strictly validate every word
config = SpellCheckerConfig(use_ner=False)

Internal Class: NameHeuristic

For advanced usage or direct access:
from myspellchecker.text.ner import NameHeuristic

ner = NameHeuristic()

# Check a word given its previous context
is_name = ner.is_potential_name(word="အောင်", prev_word="မောင်")
# True (because 'မောင်' is an honorific)

is_name = ner.is_potential_name(word="စာ", prev_word="မောင်")
# False (common word 'စာ' is unlikely to be a name here, 
# though this logic is simplified for speed)

Limitations

  • Recall vs. Precision: This system prioritizes precision (avoiding false flags on real errors) over recall (finding every single name). It effectively catches ~80% of common personal names.
  • Places/Organizations: It is less effective at detecting place names or company names unless they are in the dictionary or look like English.