Overview
Components
CorpusSegmenter
TheCorpusSegmenter processes Arrow shards and produces segmented output:
Segmentation via Pipeline (Recommended)
For most use cases, use thePipeline class:
Configuration
Options
| Option | Default | Description |
|---|---|---|
num_workers | None | Parallel workers (None = auto) |
batch_size | 10000 | Records per batch |
word_engine | "crf" | Segmentation engine |
enable_pos_tagging | True | Enable POS tagging during segmentation |
chunk_size | 50000 | Lines per chunk for parallel processing |
Segmentation Engines
CRF (PipelineConfig Default)
Conditional Random Fields - good balance of speed and accuracy. This is the default forPipelineConfig and Pipeline.build_database().
The CLI --word-engine flag defaults to "myword" instead.
MyWord (Recommended)
High-accuracy segmentation using the myword library:Comparison
| Engine | Speed | Accuracy | Dependencies |
|---|---|---|---|
| MyWord | Medium | ~98% | myword |
| CRF | Fast | ~95% | sklearn-crfsuite |
Parallel Processing
Worker Configuration
Configure parallel workers viaPipelineConfig:
macOS Note
OpenMP requires libomp on macOS:Performance Optimization
Batch Size
Larger batches improve throughput:Memory Management
For memory-constrained environments:Benchmarks
| Batch Size | Workers | Throughput |
|---|---|---|
| 1,000 | 1 | ~10K rec/s |
| 10,000 | 4 | ~50K rec/s |
| 50,000 | 8 | ~100K rec/s |
Integration with Pipeline
See Also
- Data Pipeline - Pipeline overview
- Building Stage - Next stage
- Cython Guide - Cython internals