Skip to main content
mySpellChecker includes a Named Entity Recognition (NER) module to reduce false positives by identifying names, locations, and organizations in Myanmar text.

Overview

Named entities like personal names and place names often appear as “unknown words” to spell checkers. The NER module helps identify these entities, preventing the spell checker from flagging them as errors. Entity Types Supported:
  • PER - Personal names (e.g., ကိုအောင်)
  • LOC - Locations (e.g., ရန်ကုန်မြို့)
  • ORG - Organizations (e.g., မြန်မာ့လေကြောင်း)
  • DATE - Date expressions
  • NUM - Numbers and numeric expressions
  • TIME - Time expressions
  • MISC - Miscellaneous named entities

NER Implementations

mySpellChecker provides three NER implementations with different accuracy/speed trade-offs:
ImplementationAccuracySpeedDependencies
HeuristicNER~70%FastNone
TransformerNER~93%Slowtransformers, torch
HybridNER~93%Adaptivetransformers (optional)

HeuristicNER

Fast, rule-based NER using patterns and whitelists. Ideal for real-time applications. Features:
  • Honorific-based name detection (ဦး, ဒေါ်, ကို, မ)
  • Location suffix detection (မြို့, ရွာ, ပြည်နယ်)
  • Organization pattern matching (ကုမ္ပဏီ, ဘဏ်, တက္ကသိုလ်)
  • Whitelist support for known entities
  • No external dependencies
from myspellchecker.text.ner_model import HeuristicNER, NERConfig

# Basic usage
ner = HeuristicNER()
entities = ner.extract_entities("ဦးအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")

for entity in entities:
    print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
# Output:
# အောင်: PER (0.70)
# ရန်ကုန်မြို့: LOC (0.70)

TransformerNER

High-accuracy NER using HuggingFace transformer models. Features:
  • State-of-the-art accuracy (~93%)
  • BIO tagging for multi-word entities
  • Confidence scores for each prediction
  • Batch processing support
  • LRU result caching for performance
from myspellchecker.text.ner_model import TransformerNER, NERConfig

# Using factory method
ner = TransformerNER.from_pretrained(
    "chuuhtetnaing/myanmar-ner-model",
    device=0,  # GPU (use -1 for CPU)
    confidence_threshold=0.7
)

entities = ner.extract_entities("ကိုအောင်သည် ရန်ကုန်မြို့တွင် နေသည်။")
for entity in entities:
    print(f"{entity.text}: {entity.label.value} ({entity.confidence:.2f})")
Requirements:
pip install myspellchecker[transformers]
# or
pip install transformers torch

HybridNER

Combines transformer and heuristic approaches. Uses the transformer as primary, with automatic fallback to heuristics. Features:
  • Best of both approaches
  • Graceful degradation if transformer unavailable
  • Automatic fallback on transformer errors
  • Configurable fallback behavior
from myspellchecker.text.ner_model import NERFactory, NERConfig

# HybridNER via factory
config = NERConfig(
    model_type="transformer",
    model_name="chuuhtetnaing/myanmar-ner-model",
    fallback_to_heuristic=True  # Use heuristics if transformer fails
)
ner = NERFactory.create(config)

entities = ner.extract_entities("ဦးအောင်မြင့်သည် မန္တလေးမြို့တွင် နေသည်။")

Integration with SpellChecker

NER is fully integrated into the SpellChecker pipeline. When enabled, the NER model:
  1. Provides name masks to the ContextValidator (for strategies to skip named entities)
  2. Filters errors post-validation — any error overlapping a detected entity is removed

Basic Usage (Heuristic NER)

from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig
from myspellchecker.providers import SQLiteProvider

# Heuristic NER is enabled by default via use_ner=True
config = SpellCheckerConfig(use_ner=True)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

result = checker.check("ဦးအောင်သည် စာအုပ်ဖတ်သည်။")
# "အောင်" will not be flagged as an error

With Transformer NER

For highest accuracy, configure NERConfig with a transformer model:
from myspellchecker import SpellChecker
from myspellchecker.core.config import SpellCheckerConfig, NERConfig
from myspellchecker.providers import SQLiteProvider

config = SpellCheckerConfig(
    ner=NERConfig(
        model_type="transformer",
        model_name="chuuhtetnaing/myanmar-ner-model",
        device=0,  # GPU index, -1 for CPU
        fallback_to_heuristic=True,  # Graceful degradation
    ),
)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

CLI Usage

# Check with default heuristic NER (enabled by default)
myspellchecker check input.txt

# Check with transformer NER model
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model

# Check with transformer NER on GPU
myspellchecker check input.txt --ner-model chuuhtetnaing/myanmar-ner-model --ner-device 0

# Disable NER entirely
myspellchecker check input.txt --no-ner

Disabling NER

# Disable NER for speed
config = SpellCheckerConfig(use_ner=False)
provider = SQLiteProvider(database_path="path/to/dictionary.db")
checker = SpellChecker(config=config, provider=provider)

NERConfig Options

OptionTypeDefaultDescription
enabledboolTrueEnable/disable NER
model_typestr”heuristic""heuristic” or “transformer”
model_namestr”chuuhtetnaing/myanmar-ner-model”HuggingFace model name
deviceint-1Device index (-1=CPU, 0+=GPU)
confidence_thresholdfloat0.5Minimum confidence to accept
heuristic_confidencefloat0.7Confidence for heuristic results
batch_sizeint32Batch size for transformer
cache_sizeint1000LRU cache size
fallback_to_heuristicboolTrueUse heuristics if transformer fails

Entity Data Structure

The Entity dataclass represents detected entities:
@dataclass
class Entity:
    text: str          # Entity text
    label: EntityType  # PER, LOC, ORG, DATE, NUM, TIME, MISC
    start: int         # Start character position
    end: int           # End character position
    confidence: float  # 0.0 to 1.0
    metadata: dict     # Additional info (source, pattern, etc.)

Advanced Usage

Batch Processing

Process multiple texts efficiently:
texts = [
    "ဦးအောင်သည် ရန်ကုန်တွင် နေသည်။",
    "ဒေါ်မြင့်မြင့်သည် မန္တလေးသို့ သွားသည်။",
    "ကိုဇော်ဇော်သည် ပုဂံမြို့နယ်တွင် အလုပ်လုပ်သည်။"
]

all_entities = ner.extract_entities_batch(texts)
for i, entities in enumerate(all_entities):
    print(f"Text {i+1}: {[e.text for e in entities]}")

Custom Whitelist

Add known names to reduce false negatives:
from myspellchecker.text.ner import NameHeuristic

# Create heuristic with custom whitelist
whitelist = {"ရွှေစာ", "ချစ်စုလှိုင်", "မောင်မောင်"}
heuristic = NameHeuristic(whitelist=whitelist)

# These will always be recognized as names
is_name = heuristic.is_potential_name("ရွှေစာ")  # True

Filter Errors Using NER

Manually filter spell checking errors:
from myspellchecker.text.ner_model import filter_entities_from_errors

# Get spell checking errors
result = checker.check(text)
errors = result.errors

# Filter out entities
ner = HeuristicNER()
filtered_errors = filter_entities_from_errors(text, errors, ner)

Mock NER for Testing

Use MockTransformerNER for unit tests:
from myspellchecker.text.ner_model import MockTransformerNER, Entity, EntityType

# Create mock with predefined responses
predictions = {
    "ကိုအောင်သည်": [
        Entity("အောင်", EntityType.PERSON, 3, 9, 0.95)
    ]
}
mock_ner = MockTransformerNER(predictions)

# Or use factory methods
mock = MockTransformerNER.with_default_predictions()
mock = MockTransformerNER.returning([Entity(...)])  # Always return these
mock = MockTransformerNER.empty()  # Return no entities

# Track calls for assertions
entities = mock.extract_entities("test")
assert mock.call_count == 1
assert mock.call_history == ["test"]

Performance Tips

  1. Real-time typing: Use HeuristicNER for fastest response
  2. Document checking: Use HybridNER for balance
  3. Batch processing: Use TransformerNER with batching
  4. High throughput: Enable result caching
# High-performance configuration
config = NERConfig(
    model_type="transformer",
    model_name="chuuhtetnaing/myanmar-ner-model",
    batch_size=64,      # Larger batches for throughput
    cache_size=5000,    # Larger cache for repeated texts
    device=0            # Use GPU if available
)

See Also