Initial commit

This commit is contained in:
Felipe M 2025-09-18 19:56:06 +02:00
commit 1db16227b2
Signed by: fmartingr
GPG key ID: CCFBC5637D4000A8
31 changed files with 2175 additions and 0 deletions

243
Docs/ARCHITECTURE.md Normal file
View file

@ -0,0 +1,243 @@
# Architecture — Menu-Whisper
This document describes the high-level architecture and module organization for Menu-Whisper, a macOS offline speech-to-text application.
## Overview
Menu-Whisper follows a modular architecture with clear separation of concerns between UI, audio processing, speech recognition, text injection, and system integration components.
## System Architecture
```
┌─────────────────────────────────────────────────────────┐
│ App Layer │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ MenuBarExtra │ │ HUD Panel │ │ Preferences │ │
│ │ (SwiftUI) │ │ (SwiftUI) │ │ (SwiftUI) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Core Modules │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Audio │ │ STT │ │ Injection │ │
│ │ AVAudioEngine │ │ whisper.cpp │ │ Clipboard │ │
│ │ RMS/Peak │ │ Core ML │ │ Typing │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Models │ │ Permissions │ │ Settings │ │
│ │ Management │ │ Microphone │ │ UserDefaults│ │
│ │ Downloads │ │ Accessibility │ │ JSON Export │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ System Integration │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Global Hotkeys │ │ Secure Input │ │ Utils │ │
│ │ Carbon │ │ Detection │ │ Helpers │ │
│ │ RegisterHotKey │ │ CGEvent API │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
```
## Module Descriptions
### App Layer
- **MenuBarExtra**: SwiftUI-based menu bar interface using `MenuBarExtra` for macOS 13+
- **HUD Panel**: Non-activating NSPanel for "Listening" and "Processing" states
- **Preferences**: Settings window with model management, hotkey configuration, etc.
### Core Modules
#### Core/Audio
**Purpose**: Audio capture and real-time processing
- AVAudioEngine integration for microphone input
- Real-time RMS/peak computation for visual feedback
- Audio format conversion (16kHz mono PCM for STT)
- Dictation time limits and session management
#### Core/STT
**Purpose**: Speech-to-text processing with multiple backends
- **WhisperCPP**: Primary backend using whisper.cpp with Metal acceleration
- **CoreML**: Future backend for Core ML models (Phase 6)
- `STTEngine` protocol for backend abstraction
- Language detection and text normalization
#### Core/Models
**Purpose**: Model catalog, downloads, and management
- Curated model catalog (JSON-based)
- Download management with progress tracking
- SHA256 verification and integrity checks
- Local storage in `~/Library/Application Support/MenuWhisper/Models`
- Model selection and metadata management
#### Core/Injection
**Purpose**: Text insertion into focused applications
- Clipboard-based insertion (preferred method)
- Character-by-character typing fallback
- Secure Input detection and handling
- Cross-application compatibility layer
#### Core/Permissions
**Purpose**: System permission management and onboarding
- Microphone access (AVAudioSession)
- Accessibility permissions for text injection
- Input Monitoring permissions for global hotkeys
- Permission status checking and guidance flows
#### Core/Settings
**Purpose**: User preferences and configuration persistence
- UserDefaults-based storage
- JSON export/import functionality
- Settings validation and migration
- Configuration change notifications
### System Integration
#### Global Hotkeys
- Carbon framework integration (`RegisterEventHotKey`)
- Push-to-talk and toggle modes
- Hotkey conflict detection and user guidance
- Cross-application hotkey handling
#### Secure Input Detection
- `IsSecureEventInputEnabled()` monitoring
- Safe fallback behavior (clipboard-only)
- User notification for secure contexts
#### Utils
- Shared utilities and helper functions
- Logging infrastructure (opt-in local logs)
- Error handling and user feedback
## Data Flow
### Main Operational Flow
```
User Hotkey → Audio Capture → STT Processing → Text Injection
▲ │ │ │
│ ▼ ▼ ▼
Hotkey Mgr Audio Buffer Model Engine Injection Mgr
│ RMS/Peak whisper.cpp Clipboard/Type
│ │ │ │
▼ ▼ ▼ ▼
HUD UI Visual Feedback Processing UI Target App
```
### State Management
The application follows a finite state machine pattern:
- **Idle**: Waiting for user input
- **Listening**: Capturing audio with visual feedback
- **Processing**: Running STT inference
- **Injecting**: Inserting text into target application
- **Error**: Handling and displaying errors
## Finite State Machine
```
┌─────────────┐
│ Idle │◄─────────────┐
└─────────────┘ │
│ │
│ Hotkey Press │ Success/Error
▼ │
┌─────────────┐ │
│ Listening │ │
└─────────────┘ │
│ │
│ Stop/Timeout │
▼ │
┌─────────────┐ │
│ Processing │ │
└─────────────┘ │
│ │
│ STT Complete │
▼ │
┌─────────────┐ │
│ Injecting │──────────────┘
└─────────────┘
```
## Technology Stack
### Core Technologies
- **Swift 5.9+**: Primary development language
- **SwiftUI**: User interface framework
- **AppKit**: macOS-specific UI components (NSStatusItem, NSPanel)
- **AVFoundation**: Audio capture and processing
- **Carbon**: Global hotkey registration
### External Dependencies
- **whisper.cpp**: C/C++ speech recognition engine with Metal support
- **Swift Package Manager**: Dependency management and build system
### Platform Integration
- **UserDefaults**: Settings persistence
- **NSPasteboard**: Clipboard operations
- **CGEvent**: Low-level input simulation
- **URLSession**: Model downloads
## Build System
The project uses Swift Package Manager with modular targets:
```
MenuWhisper/
├── Package.swift # SPM configuration
├── Sources/
│ ├── App/ # Main application target
│ ├── CoreAudio/ # Audio processing module
│ ├── CoreSTT/ # Speech-to-text engines
│ ├── CoreModels/ # Model management
│ ├── CoreInjection/ # Text insertion
│ ├── CorePermissions/ # System permissions
│ ├── CoreSettings/ # User preferences
│ └── CoreUtils/ # Shared utilities
├── Resources/ # Assets, localizations
└── Tests/ # Unit and integration tests
```
## Security Considerations
### Privacy
- All audio processing occurs locally
- No telemetry or data collection
- Optional local logging with user consent
### System Security
- Respects Secure Input contexts
- Requires explicit user permission grants
- Code signing and notarization for distribution
### Input Safety
- Validates all user inputs
- Safe handling of special characters in typing mode
- Proper escaping for different keyboard layouts
## Performance Characteristics
### Target Metrics
- **Latency**: <4s additional processing time for 10s audio (M1 + small model)
- **Memory**: ~1.5-2.5GB with small model
- **Model Loading**: Lazy loading with warm cache
- **UI Responsiveness**: Non-blocking background processing
### Optimization Strategies
- Metal acceleration for STT inference
- Efficient audio buffering and streaming
- Model reuse across dictation sessions
- Configurable threading for CPU-intensive operations
## Future Extensibility
The modular architecture supports future enhancements:
- Additional STT backends (Core ML, cloud services)
- Voice Activity Detection (VAD)
- Advanced audio preprocessing
- Custom insertion rules per application
- Plugin architecture for text processing
This architecture provides a solid foundation for the MVP while maintaining flexibility for future feature additions and platform evolution.