Initial commit

2025-09-18 19:56:06 +02:00 · 2025-09-18 19:56:06 +02:00 · 1db16227b2
commit 1db16227b2
31 changed files with 2175 additions and 0 deletions
--- a/TECHSPEC.md
+++ b/TECHSPEC.md
@ -0,0 +1,335 @@
+# Technical Definition — “Menu-Whisper” (macOS, Swift, Offline STT)
+
+## 0) Owner Decisions (Locked)
+- **Platform:** Apple Silicon only (M1/M2/M3), macOS 13+.
+- **STT backends:** Start with **whisper.cpp (Metal)** for simplicity; add **Core ML** backend later.
+- **Models:** Do **not** auto-download. On first run, user **chooses & downloads** a model.
+- **VAD:** Post-MVP.
+- **Insertion behavior:** Configurable; **direct insertion** is default (no preview).
+- **Default hotkey:** **⌘⇧V** (user-configurable).
+- **Punctuation:** Let the model handle punctuation automatically (no spoken commands).
+- **Privacy/Connectivity:** 100% local at runtime; model downloads only when the user explicitly requests. **No telemetry**.
+- **Distribution:** **.app/.dmg** (signed + notarized), outside the Mac App Store initially.
+- **UI languages:** **ES/EN**.
+- **Low-power mode:** Still allow downloads if the user starts them.
+- **License:** **MIT**.
+- **Per-dictation limit:** **10 minutes** by default (configurable).
+
+---
+
+## 1) Goal
+A **menu bar** app for macOS that performs **offline speech-to-text** using Whisper-family models and **inserts the transcribed text** into whichever app currently has focus. Shows a minimal **HUD** while listening and processing. No internet required during normal operation.
+
+---
+
+## 2) MVP Scope
+- Persistent **menu bar** item (NSStatusItem / `MenuBarExtra`).
+- **Global hotkey** (push-to-talk and toggle modes).
+- **HUD** (centered NSPanel + SwiftUI):
+  - “Listening” with audio-level animation (RMS/peak).
+  - “Processing” with a spinner/animation.
+- **Offline STT** with **whisper.cpp** (GGUF models; Metal acceleration on Apple Silicon).
+- **Model Manager**: curated list, manual download with progress + SHA256 check, user selection.
+- **Text injection**:
+  - Preferred: **Clipboard + ⌘V** paste.
+  - Fallback: **simulated typing** via CGEvent.
+  - If **Secure Input** is active, **do not inject**; show notice and keep text on clipboard.
+- **Preferences**: hotkey & mode, model & language, insertion method, HUD styling, sounds, dictation limit.
+- **Permissions onboarding**: Microphone, Accessibility, Input Monitoring.
+
+---
+
+## 3) Functional Requirements
+
+### 3.1 Capture
+- Prompt for permissions on first use.
+- Global hotkey (default ⌘⇧V).
+- **Push-to-talk**: start on key down, stop on key up.
+- **Toggle**: press to start, press again to stop.
+- Per-dictation limit (default 10 min, range 10 s–30 min).
+
+### 3.2 HUD / UX
+- Non-activating, centered **NSPanel** (~320×160), no focus stealing.
+- **Listening**: bar-style audio visualization driven by live RMS/peak.
+- **Processing**: spinner + “Transcribing…” label.
+- **Esc** to cancel.
+- Optional start/stop sounds (user-toggleable).
+
+### 3.3 STT
+- Backend A (MVP): **whisper.cpp** with **GGUF** and **Metal**.
+- Language: auto-detect or forced (persisted).
+- Basic text normalization; punctuation from the model.
+- UTF-8 output; standard replacements (quotes, dashes, etc.).
+
+### 3.4 Injection
+- Preferred method: **NSPasteboard** + **CGEvent** to send ⌘V.
+- Fallback: **CGEventCreateKeyboardEvent** (character-by-character), respecting active keyboard layout.
+- **Secure Input**: detect with `IsSecureEventInputEnabled()`; if enabled, **do not inject**. Show a non-intrusive notice and leave the text on the clipboard.
+
+### 3.5 Preferences
+- **General:** hotkey + mode (push/toggle), sounds, HUD options.
+- **Models:** catalog, download, select active model, language, local storage path.
+- **Insertion:** direct vs preview (preview **off** by default), paste vs type.
+- **Advanced:** limits, performance knobs (threads/batch), **local** logs opt-in.
+
+---
+
+## 4) Non-Functional Requirements
+- **Offline** execution after models are installed.
+- **Latency target** (M1 + “small” model): < 4 s for 10 s of audio.
+- **Memory target:** ~1.5–2.5 GB with “small”.
+- **Privacy:** audio and text never leave the device.
+- **Accessibility:** sufficient contrast; VoiceOver labels; focus never stolen by HUD.
+
+---
+
+## 5) Architecture (High-Level)
+- **App (SwiftUI)** with AppKit bridges for NSStatusItem and NSPanel.
+- **Shortcut Manager** (Carbon `RegisterEventHotKey` or HotKey/MASShortcut).
+- **Audio**: AVAudioEngine (downsample to 16 kHz mono, 16-bit PCM).
+- **STT Engine**:
+  - **whisper.cpp** (C/C++ via SPM/CMake) with Metal.
+  - **Core ML backend** (e.g., WhisperKit / custom) in a later phase.
+- **Model Manager**: curated catalog, downloads (progress + SHA256), selection, caching.
+- **Text Injection**: pasteboard + CGEvent; typing fallback; Secure Input detection.
+- **Permissions Manager**: guided flows to System Settings panes.
+- **Settings**: UserDefaults + JSON export/import.
+- **Packaging**: .app + .dmg (signed & notarized).
+
+---
+
+## 6) Main Flow
+1. User presses global hotkey.
+2. Check permissions; guide if missing.
+3. Show HUD → **Listening**; start capture.
+4. Stop (key up/toggle/timeout).
+5. HUD → **Processing**; run STT in background.
+6. On result → (optional preview) → **insert** (paste) or **fallback** (type). If Secure Input, **do not inject**; keep in clipboard + show notice.
+7. Close HUD → **Idle**.
+
+---
+
+## 7) Finite State Machine (FSM)
+- **Idle** → (Hotkey) → **Listening**
+- **Listening** → (Stop/Timeout) → **Processing**
+- **Processing** → (Done) → **Injecting**
+- **Injecting** → (Done) → **Idle**
+- Any → (Error) → **ErrorModal** → **Idle**
+
+---
+
+## 8) Model Management (Manual Downloads)
+**Goal:** Offer a clear list of **free** Whisper-family models (names, sizes, languages, recommended backend) with one-click downloads. No automatic downloads.
+
+### 8.1 OpenAI Whisper (official weights)
+- Families: **tiny**, **base**, **small**, **medium**, **large-v2**, **large-v3** (multilingual; some `.en` variants).
+- Usable with **whisper.cpp** via **GGUF** (community conversions widely available).
+
+### 8.2 Whisper for whisper.cpp (converted GGUF)
+- Community-maintained conversions for whisper.cpp (GGUF), optimized for CPU/GPU Metal on macOS.
+
+### 8.3 Faster-Whisper (CTranslate2)
+- Optimized variants (tiny/base/small/medium/large-v2/large-v3). Useful if a CT2-based or Core-ML-assisted backend is added later.
+
+### 8.4 Distil-Whisper (distilled)
+- Distilled models (e.g., **distil-large-v2/v3/v3.5**, **distil-small.en**), significantly smaller/faster with near-large accuracy.
+
+> **UI must show:** model file size, languages, license, **RAM estimate**, and a warning if a large model is selected on lower-memory machines.
+
+**Optional JSON Schema for catalog entries (for the app’s first-run picker):**
+
+```json
+{
+  "name": "whisper-small",
+  "family": "OpenAI-Whisper",
+  "format": "gguf",
+  "size_mb": 466,
+  "languages": ["multilingual"],
+  "recommended_backend": "whisper.cpp",
+  "quality_tier": "small",
+  "license": "MIT",
+  "sha256": "…",
+  "download_url": "…",
+  "notes": "Good balance of speed/accuracy on M1/M2."
+}
+```
+
+---
+
+## 9) Security & Permissions
+
+* **Info.plist:** `NSMicrophoneUsageDescription`.
+* **Accessibility & Input Monitoring:** required for CGEvent; provide clear step-by-step guidance and deep-links.
+* **Secure Input:** check `IsSecureEventInputEnabled()`; **never** attempt to bypass. Provide help text to identify apps that enable it (password fields, 2FA prompts, etc.).
+
+---
+
+## 10) Performance
+
+* Lazy-load and reuse model (warm cache).
+* Real-time downsampling to 16 kHz mono; chunked streaming into backend.
+* Configurable threads; prefer **Metal** path on Apple Silicon.
+* “Fast path” tweaks for short clips (<15 s).
+
+---
+
+## 11) Logging & Privacy
+
+* **No remote telemetry.**
+* Local logs **opt-in** (timings, errors only). Never store audio/text unless user explicitly enables a debug flag.
+* “Wipe local data” button (models remain unless the user removes them).
+
+---
+
+## 12) Internationalization
+
+* UI in **Spanish** and **English** (Localizable.strings).
+* STT multilingual; language auto or forced per user preference.
+
+---
+
+## 13) Testing (Minimum)
+
+* macOS 13/14/15 on M1/M2/M3.
+* Injection works in Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.
+* **Secure Input**: correctly detected; no injection; clipboard + notice.
+* Meet latency target with **small** model on M1.
+* Model download & selection flows (simulate network errors).
+
+---
+
+## 14) Phased Plan (AI-Deliverables)
+
+### Phase 0 — Scaffolding (MVP-0)
+
+**Goal:** Base project + menubar.
+**Deliverables:**
+
+* SwiftUI app with `MenuBarExtra`, microphone icon, “Idle” state.
+* `ARCHITECTURE.md` describing modules (Audio/STT/Injection/Models/Permissions/Settings).
+* Build scripts and signing/notarization templates.
+  **DoD:** Compiles; menu bar item visible; SPM structure ready.
+
+---
+
+### Phase 1 — Hotkey + HUD + Audio (MVP-1)
+
+**Goal:** Listening UX without real STT.
+**Deliverables:**
+
+* Global hotkey (default ⌘⇧V) with **push** and **toggle**.
+* NSPanel HUD (Listening/Processing) + **real** RMS bars from AVAudioEngine.
+* Per-dictation limit (default 10 min).
+  **DoD:** Live meter responds to mic; correct state transitions.
+
+---
+
+### Phase 2 — STT via whisper.cpp (MVP-2)
+
+**Goal:** Real offline transcription.
+**Deliverables:**
+
+* **whisper.cpp** module (C/C++), background inference with **Metal**.
+* **Model Manager** (curated list, download with SHA256, selection).
+* Language auto/forced; basic normalization.
+  **DoD:** 10-second clip → coherent ES/EN text offline; meets timing targets.
+
+---
+
+### Phase 3 — Robust Insertion (MVP-3)
+
+**Goal:** Reliable insertion into focused app.
+**Deliverables:**
+
+* Paste (clipboard + ⌘V) and typing fallback.
+* **Secure Input** detection; safe behavior (no injection, clipboard + notice).
+  **DoD:** Works across target apps; correct Secure Input handling.
+
+---
+
+### Phase 4 — Preferences + UX Polish (MVP-4)
+
+**Goal:** Complete options & stability.
+**Deliverables:**
+
+* Full Preferences (hotkey, modes, model, language, insertion, HUD, sounds).
+* Optional preview dialog (off by default).
+* Config export/import (JSON).
+  **DoD:** All settings persist and are honored.
+
+---
+
+### Phase 5 — Distribution (MVP-5)
+
+**Goal:** Installable package.
+**Deliverables:**
+
+* Error handling; permission prompts & help (incl. Secure Input troubleshooting).
+* **.dmg** (signed + notarized) and install guide.
+* **USER\_GUIDE.md** + **TROUBLESHOOTING.md**.
+  **DoD:** Clean install on test machines; distribution checklist passed.
+
+---
+
+### Phase 6 — Core ML Backend (Post-MVP)
+
+**Goal:** Second backend.
+**Deliverables:**
+
+* **Core ML** integration (e.g., WhisperKit or custom conversion).
+* Backend selector (whisper.cpp/Core ML) in Preferences; local benchmarks table.
+  **DoD:** Feature parity and stability; documented pros/cons.
+
+---
+
+## 15) Mini-Prompts for the Builder AI (per Phase)
+
+* **P0:** “Create macOS 13+ SwiftUI menubar app (`MenuBarExtra`), microphone icon, SPM layout with modules in `ARCHITECTURE.md`.”
+* **P1:** “Add global hotkey (push & toggle) with `RegisterEventHotKey`; NSPanel HUD with RMS bars from AVAudioEngine; 10-minute dictation limit.”
+* **P2:** “Integrate **whisper.cpp** (Metal); add Model Manager (curated list, SHA256-verified downloads, selection); language auto/forced; transcribe WAV 16 kHz mono.”
+* **P3:** “Implement insertion: pasteboard+⌘V and CGEvent typing fallback; detect `IsSecureEventInputEnabled()` and avoid injection.”
+* **P4:** “Implement full Preferences, optional preview, JSON export/import; UX polish and messages.”
+* **P5:** “Signing + notarization; produce .dmg; write USER\_GUIDE and TROUBLESHOOTING (with Secure Input section).”
+* **P6:** “Add Core ML backend (WhisperKit/custom), backend selector, and local benchmarks.”
+
+---
+
+## 16) Suggested Repo Layout
+
+```
+MenuWhisper/
+  Sources/
+    App/                 # SwiftUI + AppKit bridges
+    Core/
+      Audio/             # AVAudioEngine capture + meters
+      STT/
+        WhisperCPP/      # C/C++ wrapper + Metal path
+        CoreML/          # post-MVP
+      Models/            # catalog, downloads, hashes
+      Injection/         # clipboard, CGEvent typing, secure input checks
+      Permissions/
+      Settings/
+      Utils/
+  Resources/             # icons, sounds, localizations
+  Docs/                  # ARCHITECTURE.md, USER_GUIDE.md, TROUBLESHOOTING.md
+  Scripts/               # build, sign, notarize
+  Tests/                 # unit + integration
+```
+
+---
+
+## 17) Risks & Mitigations
+
+* **Hotkey collision (⌘⇧V)** with “Paste and Match Style” in some apps → make it discoverable & easily rebindable; warn on conflict.
+* **Secure Input** blocks injection → inform the user, keep text on clipboard, provide help to identify the app enabling it.
+* **RAM/latency** with large models → recommend **small/base** by default; show RAM/latency hints in the model picker.
+* **Keyboard layouts** → prefer paste; if typing, map using the active layout.
+
+---
+
+## 18) Global MVP Definition of Done
+
+* A 30–90 s dictation yields accurate ES/EN text **offline** and inserts correctly in common apps.
+* Secure Input is correctly detected and handled.
+* Model download/selection is robust and user-driven.
+* Shippable **.dmg** (signed + notarized) and clear docs included.