Initial commit

2025-09-18 19:56:06 +02:00 · 2025-09-18 19:56:06 +02:00 · 1db16227b2
commit 1db16227b2
31 changed files with 2175 additions and 0 deletions
--- a/Docs/ARCHITECTURE.md
+++ b/Docs/ARCHITECTURE.md
@ -0,0 +1,243 @@
+# Architecture — Menu-Whisper
+
+This document describes the high-level architecture and module organization for Menu-Whisper, a macOS offline speech-to-text application.
+
+## Overview
+
+Menu-Whisper follows a modular architecture with clear separation of concerns between UI, audio processing, speech recognition, text injection, and system integration components.
+
+## System Architecture
+
+```
+┌─────────────────────────────────────────────────────────┐
+│                       App Layer                         │
+│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
+│  │   MenuBarExtra  │ │    HUD Panel    │ │ Preferences │ │
+│  │   (SwiftUI)     │ │   (SwiftUI)     │ │  (SwiftUI)  │ │
+│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
+└─────────────────────────────────────────────────────────┘
+                               │
+                               ▼
+┌─────────────────────────────────────────────────────────┐
+│                      Core Modules                       │
+│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
+│  │     Audio       │ │      STT        │ │  Injection  │ │
+│  │  AVAudioEngine  │ │  whisper.cpp    │ │  Clipboard  │ │
+│  │   RMS/Peak      │ │   Core ML       │ │   Typing    │ │
+│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
+│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
+│  │     Models      │ │  Permissions    │ │  Settings   │ │
+│  │   Management    │ │   Microphone    │ │ UserDefaults│ │
+│  │   Downloads     │ │  Accessibility  │ │ JSON Export │ │
+│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
+└─────────────────────────────────────────────────────────┘
+                               │
+                               ▼
+┌─────────────────────────────────────────────────────────┐
+│                  System Integration                     │
+│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
+│  │ Global Hotkeys  │ │  Secure Input   │ │   Utils     │ │
+│  │    Carbon       │ │   Detection     │ │  Helpers    │ │
+│  │  RegisterHotKey │ │   CGEvent API   │ │             │ │
+│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
+└─────────────────────────────────────────────────────────┘
+```
+
+## Module Descriptions
+
+### App Layer
+- **MenuBarExtra**: SwiftUI-based menu bar interface using `MenuBarExtra` for macOS 13+
+- **HUD Panel**: Non-activating NSPanel for "Listening" and "Processing" states
+- **Preferences**: Settings window with model management, hotkey configuration, etc.
+
+### Core Modules
+
+#### Core/Audio
+**Purpose**: Audio capture and real-time processing
+- AVAudioEngine integration for microphone input
+- Real-time RMS/peak computation for visual feedback
+- Audio format conversion (16kHz mono PCM for STT)
+- Dictation time limits and session management
+
+#### Core/STT
+**Purpose**: Speech-to-text processing with multiple backends
+- **WhisperCPP**: Primary backend using whisper.cpp with Metal acceleration
+- **CoreML**: Future backend for Core ML models (Phase 6)
+- `STTEngine` protocol for backend abstraction
+- Language detection and text normalization
+
+#### Core/Models
+**Purpose**: Model catalog, downloads, and management
+- Curated model catalog (JSON-based)
+- Download management with progress tracking
+- SHA256 verification and integrity checks
+- Local storage in `~/Library/Application Support/MenuWhisper/Models`
+- Model selection and metadata management
+
+#### Core/Injection
+**Purpose**: Text insertion into focused applications
+- Clipboard-based insertion (preferred method)
+- Character-by-character typing fallback
+- Secure Input detection and handling
+- Cross-application compatibility layer
+
+#### Core/Permissions
+**Purpose**: System permission management and onboarding
+- Microphone access (AVAudioSession)
+- Accessibility permissions for text injection
+- Input Monitoring permissions for global hotkeys
+- Permission status checking and guidance flows
+
+#### Core/Settings
+**Purpose**: User preferences and configuration persistence
+- UserDefaults-based storage
+- JSON export/import functionality
+- Settings validation and migration
+- Configuration change notifications
+
+### System Integration
+
+#### Global Hotkeys
+- Carbon framework integration (`RegisterEventHotKey`)
+- Push-to-talk and toggle modes
+- Hotkey conflict detection and user guidance
+- Cross-application hotkey handling
+
+#### Secure Input Detection
+- `IsSecureEventInputEnabled()` monitoring
+- Safe fallback behavior (clipboard-only)
+- User notification for secure contexts
+
+#### Utils
+- Shared utilities and helper functions
+- Logging infrastructure (opt-in local logs)
+- Error handling and user feedback
+
+## Data Flow
+
+### Main Operational Flow
+```
+User Hotkey → Audio Capture → STT Processing → Text Injection
+     ▲              │              │              │
+     │              ▼              ▼              ▼
+ Hotkey Mgr    Audio Buffer   Model Engine   Injection Mgr
+     │          RMS/Peak      whisper.cpp    Clipboard/Type
+     │              │              │              │
+     ▼              ▼              ▼              ▼
+   HUD UI      Visual Feedback  Processing UI  Target App
+```
+
+### State Management
+The application follows a finite state machine pattern:
+- **Idle**: Waiting for user input
+- **Listening**: Capturing audio with visual feedback
+- **Processing**: Running STT inference
+- **Injecting**: Inserting text into target application
+- **Error**: Handling and displaying errors
+
+## Finite State Machine
+
+```
+    ┌─────────────┐
+    │    Idle     │◄─────────────┐
+    └─────────────┘              │
+           │                     │
+           │ Hotkey Press        │ Success/Error
+           ▼                     │
+    ┌─────────────┐              │
+    │  Listening  │              │
+    └─────────────┘              │
+           │                     │
+           │ Stop/Timeout        │
+           ▼                     │
+    ┌─────────────┐              │
+    │ Processing  │              │
+    └─────────────┘              │
+           │                     │
+           │ STT Complete        │
+           ▼                     │
+    ┌─────────────┐              │
+    │  Injecting  │──────────────┘
+    └─────────────┘
+```
+
+## Technology Stack
+
+### Core Technologies
+- **Swift 5.9+**: Primary development language
+- **SwiftUI**: User interface framework
+- **AppKit**: macOS-specific UI components (NSStatusItem, NSPanel)
+- **AVFoundation**: Audio capture and processing
+- **Carbon**: Global hotkey registration
+
+### External Dependencies
+- **whisper.cpp**: C/C++ speech recognition engine with Metal support
+- **Swift Package Manager**: Dependency management and build system
+
+### Platform Integration
+- **UserDefaults**: Settings persistence
+- **NSPasteboard**: Clipboard operations
+- **CGEvent**: Low-level input simulation
+- **URLSession**: Model downloads
+
+## Build System
+
+The project uses Swift Package Manager with modular targets:
+
+```
+MenuWhisper/
+├── Package.swift                    # SPM configuration
+├── Sources/
+│   ├── App/                        # Main application target
+│   ├── CoreAudio/                  # Audio processing module
+│   ├── CoreSTT/                    # Speech-to-text engines
+│   ├── CoreModels/                 # Model management
+│   ├── CoreInjection/              # Text insertion
+│   ├── CorePermissions/            # System permissions
+│   ├── CoreSettings/               # User preferences
+│   └── CoreUtils/                  # Shared utilities
+├── Resources/                      # Assets, localizations
+└── Tests/                         # Unit and integration tests
+```
+
+## Security Considerations
+
+### Privacy
+- All audio processing occurs locally
+- No telemetry or data collection
+- Optional local logging with user consent
+
+### System Security
+- Respects Secure Input contexts
+- Requires explicit user permission grants
+- Code signing and notarization for distribution
+
+### Input Safety
+- Validates all user inputs
+- Safe handling of special characters in typing mode
+- Proper escaping for different keyboard layouts
+
+## Performance Characteristics
+
+### Target Metrics
+- **Latency**: <4s additional processing time for 10s audio (M1 + small model)
+- **Memory**: ~1.5-2.5GB with small model
+- **Model Loading**: Lazy loading with warm cache
+- **UI Responsiveness**: Non-blocking background processing
+
+### Optimization Strategies
+- Metal acceleration for STT inference
+- Efficient audio buffering and streaming
+- Model reuse across dictation sessions
+- Configurable threading for CPU-intensive operations
+
+## Future Extensibility
+
+The modular architecture supports future enhancements:
+- Additional STT backends (Core ML, cloud services)
+- Voice Activity Detection (VAD)
+- Advanced audio preprocessing
+- Custom insertion rules per application
+- Plugin architecture for text processing
+
+This architecture provides a solid foundation for the MVP while maintaining flexibility for future feature additions and platform evolution.