- Rename application from MenuWhisper to Tell me with new domain com.fmartingr.tellme - Implement comprehensive preferences window with 6 tabs (General, Models, Text Insertion, Interface, Advanced, Permissions) - Add full English/Spanish localization for all UI elements - Create functional onboarding flow with model download capability - Implement preview dialog for transcription editing - Add settings export/import functionality - Fix HUD content display issues and add comprehensive permission checking - Enhance build scripts and app bundle creation for proper localization support
11 KiB
Architecture — Menu-Whisper
This document describes the high-level architecture and module organization for Menu-Whisper, a macOS offline speech-to-text application.
Overview
Menu-Whisper follows a modular architecture with clear separation of concerns between UI, audio processing, speech recognition, text injection, and system integration components.
System Architecture
┌─────────────────────────────────────────────────────────┐
│ App Layer │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ MenuBarExtra │ │ HUD Panel │ │ Preferences │ │
│ │ (SwiftUI) │ │ (SwiftUI) │ │ (SwiftUI) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Core Modules │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Audio │ │ STT │ │ Injection │ │
│ │ AVAudioEngine │ │ whisper.cpp │ │ Clipboard │ │
│ │ RMS/Peak │ │ Core ML │ │ Typing │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Models │ │ Permissions │ │ Settings │ │
│ │ Management │ │ Microphone │ │ UserDefaults│ │
│ │ Downloads │ │ Accessibility │ │ JSON Export │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ System Integration │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Global Hotkeys │ │ Secure Input │ │ Utils │ │
│ │ Carbon │ │ Detection │ │ Helpers │ │
│ │ RegisterHotKey │ │ CGEvent API │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
Module Descriptions
App Layer
- MenuBarExtra: SwiftUI-based menu bar interface using
MenuBarExtrafor macOS 13+ - HUD Panel: Non-activating NSPanel for "Listening" and "Processing" states
- Preferences: Settings window with model management, hotkey configuration, etc.
Core Modules
Core/Audio
Purpose: Audio capture and real-time processing
- AVAudioEngine integration for microphone input
- Real-time RMS/peak computation for visual feedback
- Audio format conversion (16kHz mono PCM for STT)
- Dictation time limits and session management
Core/STT
Purpose: Speech-to-text processing with multiple backends
- WhisperCPP: Primary backend using whisper.cpp with Metal acceleration
- CoreML: Future backend for Core ML models (Phase 6)
STTEngineprotocol for backend abstraction- Language detection and text normalization
Core/Models
Purpose: Model catalog, downloads, and management
- Curated model catalog (JSON-based)
- Download management with progress tracking
- SHA256 verification and integrity checks
- Local storage in
~/Library/Application Support/TellMe/Models - Model selection and metadata management
Core/Injection
Purpose: Text insertion into focused applications
- Clipboard-based insertion (preferred method)
- Character-by-character typing fallback
- Secure Input detection and handling
- Cross-application compatibility layer
Core/Permissions
Purpose: System permission management and onboarding
- Microphone access (AVAudioSession)
- Accessibility permissions for text injection
- Input Monitoring permissions for global hotkeys
- Permission status checking and guidance flows
Core/Settings
Purpose: User preferences and configuration persistence
- UserDefaults-based storage
- JSON export/import functionality
- Settings validation and migration
- Configuration change notifications
System Integration
Global Hotkeys
- Carbon framework integration (
RegisterEventHotKey) - Push-to-talk and toggle modes
- Hotkey conflict detection and user guidance
- Cross-application hotkey handling
Secure Input Detection
IsSecureEventInputEnabled()monitoring- Safe fallback behavior (clipboard-only)
- User notification for secure contexts
Utils
- Shared utilities and helper functions
- Logging infrastructure (opt-in local logs)
- Error handling and user feedback
Data Flow
Main Operational Flow
User Hotkey → Audio Capture → STT Processing → Text Injection
▲ │ │ │
│ ▼ ▼ ▼
Hotkey Mgr Audio Buffer Model Engine Injection Mgr
│ RMS/Peak whisper.cpp Clipboard/Type
│ │ │ │
▼ ▼ ▼ ▼
HUD UI Visual Feedback Processing UI Target App
State Management
The application follows a finite state machine pattern:
- Idle: Waiting for user input
- Listening: Capturing audio with visual feedback
- Processing: Running STT inference
- Injecting: Inserting text into target application
- Error: Handling and displaying errors
Finite State Machine
┌─────────────┐
│ Idle │◄─────────────┐
└─────────────┘ │
│ │
│ Hotkey Press │ Success/Error
▼ │
┌─────────────┐ │
│ Listening │ │
└─────────────┘ │
│ │
│ Stop/Timeout │
▼ │
┌─────────────┐ │
│ Processing │ │
└─────────────┘ │
│ │
│ STT Complete │
▼ │
┌─────────────┐ │
│ Injecting │──────────────┘
└─────────────┘
Technology Stack
Core Technologies
- Swift 5.9+: Primary development language
- SwiftUI: User interface framework
- AppKit: macOS-specific UI components (NSStatusItem, NSPanel)
- AVFoundation: Audio capture and processing
- Carbon: Global hotkey registration
External Dependencies
- whisper.cpp: C/C++ speech recognition engine with Metal support
- Swift Package Manager: Dependency management and build system
Platform Integration
- UserDefaults: Settings persistence
- NSPasteboard: Clipboard operations
- CGEvent: Low-level input simulation
- URLSession: Model downloads
Build System
The project uses Swift Package Manager with modular targets:
TellMe/
├── Package.swift # SPM configuration
├── Sources/
│ ├── App/ # Main application target
│ ├── CoreAudio/ # Audio processing module
│ ├── CoreSTT/ # Speech-to-text engines
│ ├── CoreModels/ # Model management
│ ├── CoreInjection/ # Text insertion
│ ├── CorePermissions/ # System permissions
│ ├── CoreSettings/ # User preferences
│ └── CoreUtils/ # Shared utilities
├── Resources/ # Assets, localizations
└── Tests/ # Unit and integration tests
Security Considerations
Privacy
- All audio processing occurs locally
- No telemetry or data collection
- Optional local logging with user consent
System Security
- Respects Secure Input contexts
- Requires explicit user permission grants
- Code signing and notarization for distribution
Input Safety
- Validates all user inputs
- Safe handling of special characters in typing mode
- Proper escaping for different keyboard layouts
Performance Characteristics
Target Metrics
- Latency: <4s additional processing time for 10s audio (M1 + small model)
- Memory: ~1.5-2.5GB with small model
- Model Loading: Lazy loading with warm cache
- UI Responsiveness: Non-blocking background processing
Optimization Strategies
- Metal acceleration for STT inference
- Efficient audio buffering and streaming
- Model reuse across dictation sessions
- Configurable threading for CPU-intensive operations
Future Extensibility
The modular architecture supports future enhancements:
- Additional STT backends (Core ML, cloud services)
- Voice Activity Detection (VAD)
- Advanced audio preprocessing
- Custom insertion rules per application
- Plugin architecture for text processing
This architecture provides a solid foundation for the MVP while maintaining flexibility for future feature additions and platform evolution.