POS Inference Manager

Corpus-derived dictionaries often leave many words without POS tags — they were not in the seed data, are domain-specific, or are newly encountered. This module fills that gap by applying suffix patterns, prefix patterns, numeral detection, and an ambiguous-word registry to infer POS tags with confidence scores.

Overview

from myspellchecker.data_pipeline.pos_inference_manager import POSInferenceManager

manager = POSInferenceManager(conn, cursor, console)

# Apply POS inference to untagged words
stats = manager.apply_inferred_pos(min_frequency=5)
print(f"Inferred POS for {stats['inferred']} words")

Purpose

During dictionary building, many words lack POS tags because:

They weren’t in the POS seed data
They’re domain-specific terms
They’re newly encountered words

The POSInferenceManager fills this gap using morphological rules.

POSInferenceManager Class

class POSInferenceManager:
    """Manages POS inference for the database.

    Responsibilities:
    - Apply rule-based POS inference to words
    - Track POS coverage statistics
    - Report inference progress
    """

    def __init__(
        self,
        conn: sqlite3.Connection,
        cursor: sqlite3.Cursor,
        console: Optional[PipelineConsole] = None,
    ):
        self.conn = conn
        self.cursor = cursor
        self.console = console or PipelineConsole()

Applying POS Inference

Basic Usage

manager = POSInferenceManager(conn, cursor)

# Apply inference to all untagged words
stats = manager.apply_inferred_pos()

With Options

stats = manager.apply_inferred_pos(
    min_frequency=5,        # Only infer for words with freq >= 5
    skip_tagged=True,       # Skip words that already have pos_tag
    min_confidence=0.6,     # Only apply if confidence >= 0.6
    in_transaction=False,   # Commit after updates
)

Parameters

Parameter	Default	Description
`min_frequency`	0	Minimum word frequency threshold
`skip_tagged`	True	Skip words with existing pos_tag
`min_confidence`	0.0	Minimum confidence for inference
`in_transaction`	False	Don’t commit (caller manages transaction)

Return Statistics

stats = manager.apply_inferred_pos()

print(stats)
# {
#     "total_words": 50000,
#     "inferred": 35000,
#     "skipped_tagged": 10000,
#     "skipped_low_conf": 2000,
#     "ambiguous": 5000,
#     "by_source": {
#         "suffix_pattern": 20000,
#         "prefix_pattern": 5000,
#         "numeral_detection": 1000,
#         "proper_noun_suffix": 3000,
#         "ambiguous_registry": 6000,
#     }
# }

Statistics Fields

Field	Description
`total_words`	Total words processed
`inferred`	Words with successful inference
`skipped_tagged`	Words skipped (already had pos_tag)
`skipped_low_conf`	Words skipped due to low confidence
`ambiguous`	Words with multi-POS (e.g., “N	V”)
`by_source`	Breakdown by inference source

Inference Sources

The POSInferenceEngine uses multiple strategies:

1. Suffix Patterns

# Words ending in common suffixes
"စားခဲ့သည်" → "V"  # Verb ending -သည်
"ကျောင်းသား" → "N" # Noun ending -သား

2. Prefix Patterns

# Words starting with common prefixes
"အလုပ်" → "N"  # အ- prefix (nominalization)
"မသွား" → "V"  # မ- prefix (negation)

3. Numeral Detection

# Numeric patterns
"၁၂၃" → "NUM"
"တစ်ရာ" → "NUM"

4. Proper Noun Patterns

# Capitalization/naming patterns
"ကိုမောင်" → "N"  # Title + name

5. Ambiguous Words Registry

# Known multi-POS words
"ကြီး" → "ADJ|N|V"  # Registered as ambiguous

POS Coverage Statistics

Check POS tag coverage in the database:

stats = manager.get_pos_coverage_stats()

print(stats)
# {
#     "total_words": 100000,
#     "with_pos_tag": 30000,      # From seed data
#     "with_inferred_pos": 45000, # From inference
#     "combined_coverage": 65000, # Either source
#     "no_pos": 35000,            # No POS info
#     "ambiguous": 5000,          # Multi-POS words
# }

Coverage Calculation

coverage_pct = (stats["combined_coverage"] / stats["total_words"]) * 100
print(f"POS Coverage: {coverage_pct:.1f}%")

Database Schema

The manager updates these columns:

-- Words table columns for inferred POS
ALTER TABLE words ADD COLUMN inferred_pos TEXT;
ALTER TABLE words ADD COLUMN inferred_confidence REAL;
ALTER TABLE words ADD COLUMN inferred_source TEXT;

Column Usage

Column	Description	Example
`pos_tag`	From seed data	”N”
`inferred_pos`	From inference	”N	V”
`inferred_confidence`	Confidence score	0.85
`inferred_source`	Inference method	”suffix_pattern”

Integration with Pipeline

# Pipeline delegates POS inference to DatabasePackager,
# which internally creates and manages POSInferenceManager.
# The Pipeline does NOT access conn/cursor directly.

from myspellchecker.data_pipeline import Pipeline

# During pipeline.run(), the packager stage handles POS inference:
# packager.apply_inferred_pos() is called internally
# which creates POSInferenceManager with the packager's own connection

Best Practices

1. Run After Data Loading

# Correct order in pipeline
pipeline.load_seed_data()      # Load POS seed first
pipeline.load_corpus_data()    # Load corpus
pipeline.apply_pos_inference() # Then infer missing POS

2. Use Appropriate Thresholds

# High-frequency words: more reliable inference
manager.apply_inferred_pos(
    min_frequency=10,
    min_confidence=0.7,
)

# Low-frequency words: lower thresholds
manager.apply_inferred_pos(
    min_frequency=2,
    min_confidence=0.5,
)

3. Check Coverage After Inference

stats = manager.get_pos_coverage_stats()

if stats["no_pos"] > stats["total_words"] * 0.5:
    logger.warning("More than 50% of words have no POS tag")

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Overview

Purpose

POSInferenceManager Class

Applying POS Inference

Basic Usage

With Options

Parameters

Return Statistics

Statistics Fields

Inference Sources

1. Suffix Patterns

2. Prefix Patterns

3. Numeral Detection

4. Proper Noun Patterns

5. Ambiguous Words Registry

POS Coverage Statistics

Coverage Calculation

Database Schema

Column Usage

Integration with Pipeline

Best Practices

1. Run After Data Loading

2. Use Appropriate Thresholds

3. Check Coverage After Inference

See Also

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Purpose

​POSInferenceManager Class

​Applying POS Inference

​Basic Usage

​With Options

​Parameters

​Return Statistics

​Statistics Fields

​Inference Sources

​1. Suffix Patterns

​2. Prefix Patterns

​3. Numeral Detection

​4. Proper Noun Patterns

​5. Ambiguous Words Registry

​POS Coverage Statistics

​Coverage Calculation

​Database Schema

​Column Usage

​Integration with Pipeline

​Best Practices

​1. Run After Data Loading

​2. Use Appropriate Thresholds

​3. Check Coverage After Inference

​See Also

Overview

Purpose

POSInferenceManager Class

Applying POS Inference

Basic Usage

With Options

Parameters

Return Statistics

Statistics Fields

Inference Sources

1. Suffix Patterns

2. Prefix Patterns

3. Numeral Detection

4. Proper Noun Patterns

5. Ambiguous Words Registry

POS Coverage Statistics

Coverage Calculation

Database Schema

Column Usage

Integration with Pipeline

Best Practices

1. Run After Data Loading

2. Use Appropriate Thresholds

3. Check Coverage After Inference

See Also