tellme/Docs/ARCHITECTURE.md at 5663f3c3de7746fb94817468d2e83f716b631014

2025-09-18 19:56:06 +02:00

11 KiB

Raw Blame History

This document describes the high-level architecture and module organization for Menu-Whisper, a macOS offline speech-to-text application.

Overview

Menu-Whisper follows a modular architecture with clear separation of concerns between UI, audio processing, speech recognition, text injection, and system integration components.

System Architecture

┌─────────────────────────────────────────────────────────┐
│                       App Layer                         │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │   MenuBarExtra  │ │    HUD Panel    │ │ Preferences │ │
│  │   (SwiftUI)     │ │   (SwiftUI)     │ │  (SwiftUI)  │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────┐
│                      Core Modules                       │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │     Audio       │ │      STT        │ │  Injection  │ │
│  │  AVAudioEngine  │ │  whisper.cpp    │ │  Clipboard  │ │
│  │   RMS/Peak      │ │   Core ML       │ │   Typing    │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │     Models      │ │  Permissions    │ │  Settings   │ │
│  │   Management    │ │   Microphone    │ │ UserDefaults│ │
│  │   Downloads     │ │  Accessibility  │ │ JSON Export │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────┐
│                  System Integration                     │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │ Global Hotkeys  │ │  Secure Input   │ │   Utils     │ │
│  │    Carbon       │ │   Detection     │ │  Helpers    │ │
│  │  RegisterHotKey │ │   CGEvent API   │ │             │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘

Module Descriptions

App Layer

MenuBarExtra: SwiftUI-based menu bar interface using MenuBarExtra for macOS 13+
HUD Panel: Non-activating NSPanel for "Listening" and "Processing" states
Preferences: Settings window with model management, hotkey configuration, etc.

Core Modules

Core/Audio

Purpose: Audio capture and real-time processing

AVAudioEngine integration for microphone input
Real-time RMS/peak computation for visual feedback
Audio format conversion (16kHz mono PCM for STT)
Dictation time limits and session management

Core/STT

Purpose: Speech-to-text processing with multiple backends

WhisperCPP: Primary backend using whisper.cpp with Metal acceleration
CoreML: Future backend for Core ML models (Phase 6)
STTEngine protocol for backend abstraction
Language detection and text normalization

Core/Models

Purpose: Model catalog, downloads, and management

Curated model catalog (JSON-based)
Download management with progress tracking
SHA256 verification and integrity checks
Local storage in ~/Library/Application Support/MenuWhisper/Models
Model selection and metadata management

Core/Injection

Purpose: Text insertion into focused applications

Clipboard-based insertion (preferred method)
Character-by-character typing fallback
Secure Input detection and handling
Cross-application compatibility layer

Core/Permissions

Purpose: System permission management and onboarding

Microphone access (AVAudioSession)
Accessibility permissions for text injection
Input Monitoring permissions for global hotkeys
Permission status checking and guidance flows

Core/Settings

Purpose: User preferences and configuration persistence

UserDefaults-based storage
JSON export/import functionality
Settings validation and migration
Configuration change notifications

System Integration

Global Hotkeys

Carbon framework integration (RegisterEventHotKey)
Push-to-talk and toggle modes
Hotkey conflict detection and user guidance
Cross-application hotkey handling

Secure Input Detection

IsSecureEventInputEnabled() monitoring
Safe fallback behavior (clipboard-only)
User notification for secure contexts

Utils

Shared utilities and helper functions
Logging infrastructure (opt-in local logs)
Error handling and user feedback

Data Flow

Main Operational Flow

User Hotkey → Audio Capture → STT Processing → Text Injection
     ▲              │              │              │
     │              ▼              ▼              ▼
 Hotkey Mgr    Audio Buffer   Model Engine   Injection Mgr
     │          RMS/Peak      whisper.cpp    Clipboard/Type
     │              │              │              │
     ▼              ▼              ▼              ▼
   HUD UI      Visual Feedback  Processing UI  Target App

State Management

The application follows a finite state machine pattern:

Idle: Waiting for user input
Listening: Capturing audio with visual feedback
Processing: Running STT inference
Injecting: Inserting text into target application
Error: Handling and displaying errors

Finite State Machine

    ┌─────────────┐
    │    Idle     │◄─────────────┐
    └─────────────┘              │
           │                     │
           │ Hotkey Press        │ Success/Error
           ▼                     │
    ┌─────────────┐              │
    │  Listening  │              │
    └─────────────┘              │
           │                     │
           │ Stop/Timeout        │
           ▼                     │
    ┌─────────────┐              │
    │ Processing  │              │
    └─────────────┘              │
           │                     │
           │ STT Complete        │
           ▼                     │
    ┌─────────────┐              │
    │  Injecting  │──────────────┘
    └─────────────┘

Technology Stack

Core Technologies

Swift 5.9+: Primary development language
SwiftUI: User interface framework
AppKit: macOS-specific UI components (NSStatusItem, NSPanel)
AVFoundation: Audio capture and processing
Carbon: Global hotkey registration

External Dependencies

whisper.cpp: C/C++ speech recognition engine with Metal support
Swift Package Manager: Dependency management and build system

Platform Integration

UserDefaults: Settings persistence
NSPasteboard: Clipboard operations
CGEvent: Low-level input simulation
URLSession: Model downloads

Build System

The project uses Swift Package Manager with modular targets:

MenuWhisper/
├── Package.swift                    # SPM configuration
├── Sources/
│   ├── App/                        # Main application target
│   ├── CoreAudio/                  # Audio processing module
│   ├── CoreSTT/                    # Speech-to-text engines
│   ├── CoreModels/                 # Model management
│   ├── CoreInjection/              # Text insertion
│   ├── CorePermissions/            # System permissions
│   ├── CoreSettings/               # User preferences
│   └── CoreUtils/                  # Shared utilities
├── Resources/                      # Assets, localizations
└── Tests/                         # Unit and integration tests

Security Considerations

Privacy

All audio processing occurs locally
No telemetry or data collection
Optional local logging with user consent

System Security

Respects Secure Input contexts
Requires explicit user permission grants
Code signing and notarization for distribution

Input Safety

Validates all user inputs
Safe handling of special characters in typing mode
Proper escaping for different keyboard layouts

Performance Characteristics

Target Metrics

Latency: <4s additional processing time for 10s audio (M1 + small model)
Memory: ~1.5-2.5GB with small model
Model Loading: Lazy loading with warm cache
UI Responsiveness: Non-blocking background processing

Optimization Strategies

Metal acceleration for STT inference
Efficient audio buffering and streaming
Model reuse across dictation sessions
Configurable threading for CPU-intensive operations

Future Extensibility

The modular architecture supports future enhancements:

Additional STT backends (Core ML, cloud services)
Voice Activity Detection (VAD)
Advanced audio preprocessing
Custom insertion rules per application
Plugin architecture for text processing

This architecture provides a solid foundation for the MVP while maintaining flexibility for future feature additions and platform evolution.

11 KiB Raw Blame History

Architecture — Menu-Whisper