Overview
Purpose
During dictionary building, many words lack POS tags because:- They weren’t in the POS seed data
- They’re domain-specific terms
- They’re newly encountered words
POSInferenceManager Class
Applying POS Inference
Basic Usage
With Options
Parameters
| Parameter | Default | Description |
|---|---|---|
min_frequency | 0 | Minimum word frequency threshold |
skip_tagged | True | Skip words with existing pos_tag |
min_confidence | 0.0 | Minimum confidence for inference |
in_transaction | False | Don’t commit (caller manages transaction) |
Return Statistics
Statistics Fields
| Field | Description | |
|---|---|---|
total_words | Total words processed | |
inferred | Words with successful inference | |
skipped_tagged | Words skipped (already had pos_tag) | |
skipped_low_conf | Words skipped due to low confidence | |
ambiguous | Words with multi-POS (e.g., “N | V”) |
by_source | Breakdown by inference source |
Inference Sources
The POSInferenceEngine uses multiple strategies:1. Suffix Patterns
2. Prefix Patterns
3. Numeral Detection
4. Proper Noun Patterns
5. Ambiguous Words Registry
POS Coverage Statistics
Check POS tag coverage in the database:Coverage Calculation
Database Schema
The manager updates these columns:Column Usage
| Column | Description | Example | |
|---|---|---|---|
pos_tag | From seed data | ”N” | |
inferred_pos | From inference | ”N | V” |
inferred_confidence | Confidence score | 0.85 | |
inferred_source | Inference method | ”suffix_pattern” |
Integration with Pipeline
Best Practices
1. Run After Data Loading
2. Use Appropriate Thresholds
3. Check Coverage After Inference
See Also
- POS Tagging - POS tagging overview
- POS Disambiguator - Disambiguation rules
- Schema Management - Database schema
- Data Pipeline - Full pipeline documentation