Processing Stage

After ingestion converts raw corpus files into Arrow shards, the processing stage runs normalization and segmentation over every record — breaking continuous Myanmar text into syllables and words that downstream stages can count and index.

Overview

Arrow Shard Files
      │
      ▼
┌─────────────────────────┐
│   CorpusSegmenter       │
│  - Normalize text       │
│  - Segment syllables    │
│  - Segment words        │
│  - Parallel processing  │
└─────────────────────────┘
      │
      ▼
segmented_corpus.arrow

Components

CorpusSegmenter

The CorpusSegmenter processes Arrow shards and produces segmented output:

from myspellchecker.data_pipeline import CorpusSegmenter

segmenter = CorpusSegmenter(
    output_dir="intermediate/",
    word_engine="crf",  # "myword" or "crf"
)

# Segment corpus from Arrow shards
segmented_path = segmenter.segment_corpus(raw_shards_dir)

Segmentation via Pipeline (Recommended)

For most use cases, use the Pipeline class:

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    word_engine="crf",   # Segmentation engine
    num_workers=4,       # Parallel workers
)

pipeline = Pipeline(config=config)
pipeline.build_database(input_files, database_path)

Configuration

from myspellchecker.data_pipeline import PipelineConfig, SegmenterConfig

# Via PipelineConfig (recommended)
config = PipelineConfig(
    word_engine="crf",       # "myword" or "crf"
    num_workers=4,           # Parallel workers (None = auto)
    batch_size=10000,        # Records per batch
)

# Segmenter-specific configuration
segmenter_config = SegmenterConfig(
    batch_size=10000,
    word_engine="crf",
    num_workers=4,
    enable_pos_tagging=True,
    chunk_size=50000,        # Lines per chunk for parallel processing
)

Options

Option	Default	Description
`num_workers`	`None`	Parallel workers (None = auto)
`batch_size`	`10000`	Records per batch
`word_engine`	`"crf"`	Segmentation engine
`enable_pos_tagging`	`True`	Enable POS tagging during segmentation
`chunk_size`	`50000`	Lines per chunk for parallel processing

Segmentation Engines

CRF (PipelineConfig Default)

Conditional Random Fields - good balance of speed and accuracy. This is the default for PipelineConfig and Pipeline.build_database(). The CLI --word-engine flag defaults to "myword" instead.

config = PipelineConfig(
    word_engine="crf",
)

MyWord (Recommended)

High-accuracy segmentation using the myword library:

config = PipelineConfig(
    word_engine="myword",
)

Comparison

Engine	Speed	Accuracy	Dependencies
MyWord	Medium	~98%	myword
CRF	Fast	~95%	sklearn-crfsuite

Parallel Processing

Worker Configuration

Configure parallel workers via PipelineConfig:

# Auto-detect CPU cores
config = PipelineConfig(num_workers=None)

# Manual setting
config = PipelineConfig(num_workers=8)

macOS Note

OpenMP requires libomp on macOS:

brew install libomp

Performance Optimization

Batch Size

Larger batches improve throughput:

# Small files
config = PipelineConfig(batch_size=1000)

# Large files
config = PipelineConfig(batch_size=50000)

Memory Management

For memory-constrained environments:

config = PipelineConfig(
    batch_size=5000,  # Smaller batches
    num_workers=2,    # Fewer workers
)

Benchmarks

Batch Size	Workers	Throughput
1,000	1	~10K rec/s
10,000	4	~50K rec/s
50,000	8	~100K rec/s

Integration with Pipeline

from myspellchecker.data_pipeline import Pipeline, PipelineConfig

config = PipelineConfig(
    word_engine="crf",
    num_workers=4,
    batch_size=10000,
)

pipeline = Pipeline(config=config)
pipeline.build_database(
    input_files=["corpus.txt"],
    database_path="dict.db",
)

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

Overview

Components

CorpusSegmenter

Segmentation via Pipeline (Recommended)

Configuration

Options

Segmentation Engines

CRF (PipelineConfig Default)

MyWord (Recommended)

Comparison

Parallel Processing

Worker Configuration

macOS Note

Performance Optimization

Batch Size

Memory Management

Benchmarks

Integration with Pipeline

See Also

API Reference

Core Internals

Spelling Algorithms

Context & Grammar Algorithms

Segmentation & Tagging

Architecture

CLI Reference

Error & Rules Reference

Data Reference

Data Pipeline Internals

​Overview

​Components

​CorpusSegmenter

​Segmentation via Pipeline (Recommended)

​Configuration

​Options

​Segmentation Engines

​CRF (PipelineConfig Default)

​MyWord (Recommended)

​Comparison

​Parallel Processing

​Worker Configuration

​macOS Note

​Performance Optimization

​Batch Size

​Memory Management

​Benchmarks

​Integration with Pipeline

​See Also

Overview

Components

CorpusSegmenter

Segmentation via Pipeline (Recommended)

Configuration

Options

Segmentation Engines

CRF (PipelineConfig Default)

MyWord (Recommended)

Comparison

Parallel Processing

Worker Configuration

macOS Note

Performance Optimization

Batch Size

Memory Management

Benchmarks

Integration with Pipeline

See Also