text

Quick Reference

Task	Load
Creative writing (voice, dialogue, POV)	`writing.md`
Data processing (CSV, regex, encoding)	`data.md`
Academic/citations (APA, MLA, Chicago)	`academic.md`
Marketing copy (headlines, CTA, email)	`copy.md`
Translation/localization	`localization.md`

Universal Text Rules

Encoding

Always verify encoding first: file -bi document.txt
Normalize line endings: tr -d 'r'
Remove BOM if present: sed -i '1s/^xEFxBBxBF//'

Whitespace

Collapse multiple spaces: sed 's/[[:space:]]+/ /g'
Trim leading/trailing: sed 's/^[[:space:]]*//;s/[[:space:]]*$//'

Common Traps

Smart quotes (" ") break parsers → normalize to "
Em/en dashes (– —) break ASCII → normalize to -
Zero-width chars invisible but break comparisons → strip them
String length ≠ byte length in UTF-8 ("café" = 4 chars, 5 bytes)

Format Detection

# Detect encoding
file -I document.txt

# Detect line endings
cat -A document.txt | head -1
# ^M at end = Windows (CRLF)
# No ^M = Unix (LF)

# Detect delimiter (CSV/TSV)
head -1 file | tr -cd ',;t|' | wc -c

Quick Transformations

Task	Command
Lowercase	`tr '[:upper:]' '[:lower:]'`
Remove punctuation	`tr -d '[:punct:]'`
Count words	`wc -w`
Count unique lines	`sort -u \| wc -l`
Find duplicates	`sort \| uniq -d`
Extract emails	`grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}'`
Extract URLs	`grep -oE 'https?://[^[:space:]<>"{}\|\^`[]]+'`

Before Processing Checklist

[ ] Encoding verified (UTF-8?)
[ ] Line endings normalized
[ ] Delimiter identified (for structured text)
[ ] Target format/style defined
[ ] Edge cases considered (empty, Unicode, special chars)

Installation

Quick Reference

Universal Text Rules

Encoding

Whitespace

Common Traps

Format Detection

Quick Transformations

Before Processing Checklist