tellme/Docs/ARCHITECTURE.md
2025-09-18 19:56:06 +02:00

11 KiB

Architecture — Menu-Whisper

This document describes the high-level architecture and module organization for Menu-Whisper, a macOS offline speech-to-text application.

Overview

Menu-Whisper follows a modular architecture with clear separation of concerns between UI, audio processing, speech recognition, text injection, and system integration components.

System Architecture

┌─────────────────────────────────────────────────────────┐
│                       App Layer                         │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │   MenuBarExtra  │ │    HUD Panel    │ │ Preferences │ │
│  │   (SwiftUI)     │ │   (SwiftUI)     │ │  (SwiftUI)  │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────┐
│                      Core Modules                       │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │     Audio       │ │      STT        │ │  Injection  │ │
│  │  AVAudioEngine  │ │  whisper.cpp    │ │  Clipboard  │ │
│  │   RMS/Peak      │ │   Core ML       │ │   Typing    │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │     Models      │ │  Permissions    │ │  Settings   │ │
│  │   Management    │ │   Microphone    │ │ UserDefaults│ │
│  │   Downloads     │ │  Accessibility  │ │ JSON Export │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────┐
│                  System Integration                     │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│  │ Global Hotkeys  │ │  Secure Input   │ │   Utils     │ │
│  │    Carbon       │ │   Detection     │ │  Helpers    │ │
│  │  RegisterHotKey │ │   CGEvent API   │ │             │ │
│  └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘

Module Descriptions

App Layer

  • MenuBarExtra: SwiftUI-based menu bar interface using MenuBarExtra for macOS 13+
  • HUD Panel: Non-activating NSPanel for "Listening" and "Processing" states
  • Preferences: Settings window with model management, hotkey configuration, etc.

Core Modules

Core/Audio

Purpose: Audio capture and real-time processing

  • AVAudioEngine integration for microphone input
  • Real-time RMS/peak computation for visual feedback
  • Audio format conversion (16kHz mono PCM for STT)
  • Dictation time limits and session management

Core/STT

Purpose: Speech-to-text processing with multiple backends

  • WhisperCPP: Primary backend using whisper.cpp with Metal acceleration
  • CoreML: Future backend for Core ML models (Phase 6)
  • STTEngine protocol for backend abstraction
  • Language detection and text normalization

Core/Models

Purpose: Model catalog, downloads, and management

  • Curated model catalog (JSON-based)
  • Download management with progress tracking
  • SHA256 verification and integrity checks
  • Local storage in ~/Library/Application Support/MenuWhisper/Models
  • Model selection and metadata management

Core/Injection

Purpose: Text insertion into focused applications

  • Clipboard-based insertion (preferred method)
  • Character-by-character typing fallback
  • Secure Input detection and handling
  • Cross-application compatibility layer

Core/Permissions

Purpose: System permission management and onboarding

  • Microphone access (AVAudioSession)
  • Accessibility permissions for text injection
  • Input Monitoring permissions for global hotkeys
  • Permission status checking and guidance flows

Core/Settings

Purpose: User preferences and configuration persistence

  • UserDefaults-based storage
  • JSON export/import functionality
  • Settings validation and migration
  • Configuration change notifications

System Integration

Global Hotkeys

  • Carbon framework integration (RegisterEventHotKey)
  • Push-to-talk and toggle modes
  • Hotkey conflict detection and user guidance
  • Cross-application hotkey handling

Secure Input Detection

  • IsSecureEventInputEnabled() monitoring
  • Safe fallback behavior (clipboard-only)
  • User notification for secure contexts

Utils

  • Shared utilities and helper functions
  • Logging infrastructure (opt-in local logs)
  • Error handling and user feedback

Data Flow

Main Operational Flow

User Hotkey → Audio Capture → STT Processing → Text Injection
     ▲              │              │              │
     │              ▼              ▼              ▼
 Hotkey Mgr    Audio Buffer   Model Engine   Injection Mgr
     │          RMS/Peak      whisper.cpp    Clipboard/Type
     │              │              │              │
     ▼              ▼              ▼              ▼
   HUD UI      Visual Feedback  Processing UI  Target App

State Management

The application follows a finite state machine pattern:

  • Idle: Waiting for user input
  • Listening: Capturing audio with visual feedback
  • Processing: Running STT inference
  • Injecting: Inserting text into target application
  • Error: Handling and displaying errors

Finite State Machine

    ┌─────────────┐
    │    Idle     │◄─────────────┐
    └─────────────┘              │
           │                     │
           │ Hotkey Press        │ Success/Error
           ▼                     │
    ┌─────────────┐              │
    │  Listening  │              │
    └─────────────┘              │
           │                     │
           │ Stop/Timeout        │
           ▼                     │
    ┌─────────────┐              │
    │ Processing  │              │
    └─────────────┘              │
           │                     │
           │ STT Complete        │
           ▼                     │
    ┌─────────────┐              │
    │  Injecting  │──────────────┘
    └─────────────┘

Technology Stack

Core Technologies

  • Swift 5.9+: Primary development language
  • SwiftUI: User interface framework
  • AppKit: macOS-specific UI components (NSStatusItem, NSPanel)
  • AVFoundation: Audio capture and processing
  • Carbon: Global hotkey registration

External Dependencies

  • whisper.cpp: C/C++ speech recognition engine with Metal support
  • Swift Package Manager: Dependency management and build system

Platform Integration

  • UserDefaults: Settings persistence
  • NSPasteboard: Clipboard operations
  • CGEvent: Low-level input simulation
  • URLSession: Model downloads

Build System

The project uses Swift Package Manager with modular targets:

MenuWhisper/
├── Package.swift                    # SPM configuration
├── Sources/
│   ├── App/                        # Main application target
│   ├── CoreAudio/                  # Audio processing module
│   ├── CoreSTT/                    # Speech-to-text engines
│   ├── CoreModels/                 # Model management
│   ├── CoreInjection/              # Text insertion
│   ├── CorePermissions/            # System permissions
│   ├── CoreSettings/               # User preferences
│   └── CoreUtils/                  # Shared utilities
├── Resources/                      # Assets, localizations
└── Tests/                         # Unit and integration tests

Security Considerations

Privacy

  • All audio processing occurs locally
  • No telemetry or data collection
  • Optional local logging with user consent

System Security

  • Respects Secure Input contexts
  • Requires explicit user permission grants
  • Code signing and notarization for distribution

Input Safety

  • Validates all user inputs
  • Safe handling of special characters in typing mode
  • Proper escaping for different keyboard layouts

Performance Characteristics

Target Metrics

  • Latency: <4s additional processing time for 10s audio (M1 + small model)
  • Memory: ~1.5-2.5GB with small model
  • Model Loading: Lazy loading with warm cache
  • UI Responsiveness: Non-blocking background processing

Optimization Strategies

  • Metal acceleration for STT inference
  • Efficient audio buffering and streaming
  • Model reuse across dictation sessions
  • Configurable threading for CPU-intensive operations

Future Extensibility

The modular architecture supports future enhancements:

  • Additional STT backends (Core ML, cloud services)
  • Voice Activity Detection (VAD)
  • Advanced audio preprocessing
  • Custom insertion rules per application
  • Plugin architecture for text processing

This architecture provides a solid foundation for the MVP while maintaining flexibility for future feature additions and platform evolution.