tellme/TECHSPEC.md
2025-09-18 19:56:06 +02:00

13 KiB
Raw Permalink Blame History

Technical Definition — “Menu-Whisper” (macOS, Swift, Offline STT)

0) Owner Decisions (Locked)

  • Platform: Apple Silicon only (M1/M2/M3), macOS 13+.
  • STT backends: Start with whisper.cpp (Metal) for simplicity; add Core ML backend later.
  • Models: Do not auto-download. On first run, user chooses & downloads a model.
  • VAD: Post-MVP.
  • Insertion behavior: Configurable; direct insertion is default (no preview).
  • Default hotkey: ⌘⇧V (user-configurable).
  • Punctuation: Let the model handle punctuation automatically (no spoken commands).
  • Privacy/Connectivity: 100% local at runtime; model downloads only when the user explicitly requests. No telemetry.
  • Distribution: .app/.dmg (signed + notarized), outside the Mac App Store initially.
  • UI languages: ES/EN.
  • Low-power mode: Still allow downloads if the user starts them.
  • License: MIT.
  • Per-dictation limit: 10 minutes by default (configurable).

1) Goal

A menu bar app for macOS that performs offline speech-to-text using Whisper-family models and inserts the transcribed text into whichever app currently has focus. Shows a minimal HUD while listening and processing. No internet required during normal operation.


2) MVP Scope

  • Persistent menu bar item (NSStatusItem / MenuBarExtra).
  • Global hotkey (push-to-talk and toggle modes).
  • HUD (centered NSPanel + SwiftUI):
    • “Listening” with audio-level animation (RMS/peak).
    • “Processing” with a spinner/animation.
  • Offline STT with whisper.cpp (GGUF models; Metal acceleration on Apple Silicon).
  • Model Manager: curated list, manual download with progress + SHA256 check, user selection.
  • Text injection:
    • Preferred: Clipboard + ⌘V paste.
    • Fallback: simulated typing via CGEvent.
    • If Secure Input is active, do not inject; show notice and keep text on clipboard.
  • Preferences: hotkey & mode, model & language, insertion method, HUD styling, sounds, dictation limit.
  • Permissions onboarding: Microphone, Accessibility, Input Monitoring.

3) Functional Requirements

3.1 Capture

  • Prompt for permissions on first use.
  • Global hotkey (default ⌘⇧V).
  • Push-to-talk: start on key down, stop on key up.
  • Toggle: press to start, press again to stop.
  • Per-dictation limit (default 10 min, range 10 s30 min).

3.2 HUD / UX

  • Non-activating, centered NSPanel (~320×160), no focus stealing.
  • Listening: bar-style audio visualization driven by live RMS/peak.
  • Processing: spinner + “Transcribing…” label.
  • Esc to cancel.
  • Optional start/stop sounds (user-toggleable).

3.3 STT

  • Backend A (MVP): whisper.cpp with GGUF and Metal.
  • Language: auto-detect or forced (persisted).
  • Basic text normalization; punctuation from the model.
  • UTF-8 output; standard replacements (quotes, dashes, etc.).

3.4 Injection

  • Preferred method: NSPasteboard + CGEvent to send ⌘V.
  • Fallback: CGEventCreateKeyboardEvent (character-by-character), respecting active keyboard layout.
  • Secure Input: detect with IsSecureEventInputEnabled(); if enabled, do not inject. Show a non-intrusive notice and leave the text on the clipboard.

3.5 Preferences

  • General: hotkey + mode (push/toggle), sounds, HUD options.
  • Models: catalog, download, select active model, language, local storage path.
  • Insertion: direct vs preview (preview off by default), paste vs type.
  • Advanced: limits, performance knobs (threads/batch), local logs opt-in.

4) Non-Functional Requirements

  • Offline execution after models are installed.
  • Latency target (M1 + “small” model): < 4 s for 10 s of audio.
  • Memory target: ~1.52.5 GB with “small”.
  • Privacy: audio and text never leave the device.
  • Accessibility: sufficient contrast; VoiceOver labels; focus never stolen by HUD.

5) Architecture (High-Level)

  • App (SwiftUI) with AppKit bridges for NSStatusItem and NSPanel.
  • Shortcut Manager (Carbon RegisterEventHotKey or HotKey/MASShortcut).
  • Audio: AVAudioEngine (downsample to 16 kHz mono, 16-bit PCM).
  • STT Engine:
    • whisper.cpp (C/C++ via SPM/CMake) with Metal.
    • Core ML backend (e.g., WhisperKit / custom) in a later phase.
  • Model Manager: curated catalog, downloads (progress + SHA256), selection, caching.
  • Text Injection: pasteboard + CGEvent; typing fallback; Secure Input detection.
  • Permissions Manager: guided flows to System Settings panes.
  • Settings: UserDefaults + JSON export/import.
  • Packaging: .app + .dmg (signed & notarized).

6) Main Flow

  1. User presses global hotkey.
  2. Check permissions; guide if missing.
  3. Show HUD → Listening; start capture.
  4. Stop (key up/toggle/timeout).
  5. HUD → Processing; run STT in background.
  6. On result → (optional preview) → insert (paste) or fallback (type). If Secure Input, do not inject; keep in clipboard + show notice.
  7. Close HUD → Idle.

7) Finite State Machine (FSM)

  • Idle → (Hotkey) → Listening
  • Listening → (Stop/Timeout) → Processing
  • Processing → (Done) → Injecting
  • Injecting → (Done) → Idle
  • Any → (Error) → ErrorModalIdle

8) Model Management (Manual Downloads)

Goal: Offer a clear list of free Whisper-family models (names, sizes, languages, recommended backend) with one-click downloads. No automatic downloads.

8.1 OpenAI Whisper (official weights)

  • Families: tiny, base, small, medium, large-v2, large-v3 (multilingual; some .en variants).
  • Usable with whisper.cpp via GGUF (community conversions widely available).

8.2 Whisper for whisper.cpp (converted GGUF)

  • Community-maintained conversions for whisper.cpp (GGUF), optimized for CPU/GPU Metal on macOS.

8.3 Faster-Whisper (CTranslate2)

  • Optimized variants (tiny/base/small/medium/large-v2/large-v3). Useful if a CT2-based or Core-ML-assisted backend is added later.

8.4 Distil-Whisper (distilled)

  • Distilled models (e.g., distil-large-v2/v3/v3.5, distil-small.en), significantly smaller/faster with near-large accuracy.

UI must show: model file size, languages, license, RAM estimate, and a warning if a large model is selected on lower-memory machines.

Optional JSON Schema for catalog entries (for the apps first-run picker):

{
  "name": "whisper-small",
  "family": "OpenAI-Whisper",
  "format": "gguf",
  "size_mb": 466,
  "languages": ["multilingual"],
  "recommended_backend": "whisper.cpp",
  "quality_tier": "small",
  "license": "MIT",
  "sha256": "…",
  "download_url": "…",
  "notes": "Good balance of speed/accuracy on M1/M2."
}

9) Security & Permissions

  • Info.plist: NSMicrophoneUsageDescription.
  • Accessibility & Input Monitoring: required for CGEvent; provide clear step-by-step guidance and deep-links.
  • Secure Input: check IsSecureEventInputEnabled(); never attempt to bypass. Provide help text to identify apps that enable it (password fields, 2FA prompts, etc.).

10) Performance

  • Lazy-load and reuse model (warm cache).
  • Real-time downsampling to 16 kHz mono; chunked streaming into backend.
  • Configurable threads; prefer Metal path on Apple Silicon.
  • “Fast path” tweaks for short clips (<15 s).

11) Logging & Privacy

  • No remote telemetry.
  • Local logs opt-in (timings, errors only). Never store audio/text unless user explicitly enables a debug flag.
  • “Wipe local data” button (models remain unless the user removes them).

12) Internationalization

  • UI in Spanish and English (Localizable.strings).
  • STT multilingual; language auto or forced per user preference.

13) Testing (Minimum)

  • macOS 13/14/15 on M1/M2/M3.
  • Injection works in Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.
  • Secure Input: correctly detected; no injection; clipboard + notice.
  • Meet latency target with small model on M1.
  • Model download & selection flows (simulate network errors).

14) Phased Plan (AI-Deliverables)

Phase 0 — Scaffolding (MVP-0)

Goal: Base project + menubar. Deliverables:

  • SwiftUI app with MenuBarExtra, microphone icon, “Idle” state.
  • ARCHITECTURE.md describing modules (Audio/STT/Injection/Models/Permissions/Settings).
  • Build scripts and signing/notarization templates. DoD: Compiles; menu bar item visible; SPM structure ready.

Phase 1 — Hotkey + HUD + Audio (MVP-1)

Goal: Listening UX without real STT. Deliverables:

  • Global hotkey (default ⌘⇧V) with push and toggle.
  • NSPanel HUD (Listening/Processing) + real RMS bars from AVAudioEngine.
  • Per-dictation limit (default 10 min). DoD: Live meter responds to mic; correct state transitions.

Phase 2 — STT via whisper.cpp (MVP-2)

Goal: Real offline transcription. Deliverables:

  • whisper.cpp module (C/C++), background inference with Metal.
  • Model Manager (curated list, download with SHA256, selection).
  • Language auto/forced; basic normalization. DoD: 10-second clip → coherent ES/EN text offline; meets timing targets.

Phase 3 — Robust Insertion (MVP-3)

Goal: Reliable insertion into focused app. Deliverables:

  • Paste (clipboard + ⌘V) and typing fallback.
  • Secure Input detection; safe behavior (no injection, clipboard + notice). DoD: Works across target apps; correct Secure Input handling.

Phase 4 — Preferences + UX Polish (MVP-4)

Goal: Complete options & stability. Deliverables:

  • Full Preferences (hotkey, modes, model, language, insertion, HUD, sounds).
  • Optional preview dialog (off by default).
  • Config export/import (JSON). DoD: All settings persist and are honored.

Phase 5 — Distribution (MVP-5)

Goal: Installable package. Deliverables:

  • Error handling; permission prompts & help (incl. Secure Input troubleshooting).
  • .dmg (signed + notarized) and install guide.
  • USER_GUIDE.md + TROUBLESHOOTING.md. DoD: Clean install on test machines; distribution checklist passed.

Phase 6 — Core ML Backend (Post-MVP)

Goal: Second backend. Deliverables:

  • Core ML integration (e.g., WhisperKit or custom conversion).
  • Backend selector (whisper.cpp/Core ML) in Preferences; local benchmarks table. DoD: Feature parity and stability; documented pros/cons.

15) Mini-Prompts for the Builder AI (per Phase)

  • P0: “Create macOS 13+ SwiftUI menubar app (MenuBarExtra), microphone icon, SPM layout with modules in ARCHITECTURE.md.”
  • P1: “Add global hotkey (push & toggle) with RegisterEventHotKey; NSPanel HUD with RMS bars from AVAudioEngine; 10-minute dictation limit.”
  • P2: “Integrate whisper.cpp (Metal); add Model Manager (curated list, SHA256-verified downloads, selection); language auto/forced; transcribe WAV 16 kHz mono.”
  • P3: “Implement insertion: pasteboard+⌘V and CGEvent typing fallback; detect IsSecureEventInputEnabled() and avoid injection.”
  • P4: “Implement full Preferences, optional preview, JSON export/import; UX polish and messages.”
  • P5: “Signing + notarization; produce .dmg; write USER_GUIDE and TROUBLESHOOTING (with Secure Input section).”
  • P6: “Add Core ML backend (WhisperKit/custom), backend selector, and local benchmarks.”

16) Suggested Repo Layout

MenuWhisper/
  Sources/
    App/                 # SwiftUI + AppKit bridges
    Core/
      Audio/             # AVAudioEngine capture + meters
      STT/
        WhisperCPP/      # C/C++ wrapper + Metal path
        CoreML/          # post-MVP
      Models/            # catalog, downloads, hashes
      Injection/         # clipboard, CGEvent typing, secure input checks
      Permissions/
      Settings/
      Utils/
  Resources/             # icons, sounds, localizations
  Docs/                  # ARCHITECTURE.md, USER_GUIDE.md, TROUBLESHOOTING.md
  Scripts/               # build, sign, notarize
  Tests/                 # unit + integration

17) Risks & Mitigations

  • Hotkey collision (⌘⇧V) with “Paste and Match Style” in some apps → make it discoverable & easily rebindable; warn on conflict.
  • Secure Input blocks injection → inform the user, keep text on clipboard, provide help to identify the app enabling it.
  • RAM/latency with large models → recommend small/base by default; show RAM/latency hints in the model picker.
  • Keyboard layouts → prefer paste; if typing, map using the active layout.

18) Global MVP Definition of Done

  • A 3090 s dictation yields accurate ES/EN text offline and inserts correctly in common apps.
  • Secure Input is correctly detected and handled.
  • Model download/selection is robust and user-driven.
  • Shippable .dmg (signed + notarized) and clear docs included.