Initial commit

This commit is contained in:
Felipe M 2025-09-18 19:56:06 +02:00
commit 1db16227b2
Signed by: fmartingr
GPG key ID: CCFBC5637D4000A8
31 changed files with 2175 additions and 0 deletions

335
TECHSPEC.md Normal file
View file

@ -0,0 +1,335 @@
# Technical Definition — “Menu-Whisper” (macOS, Swift, Offline STT)
## 0) Owner Decisions (Locked)
- **Platform:** Apple Silicon only (M1/M2/M3), macOS 13+.
- **STT backends:** Start with **whisper.cpp (Metal)** for simplicity; add **Core ML** backend later.
- **Models:** Do **not** auto-download. On first run, user **chooses & downloads** a model.
- **VAD:** Post-MVP.
- **Insertion behavior:** Configurable; **direct insertion** is default (no preview).
- **Default hotkey:** **⌘⇧V** (user-configurable).
- **Punctuation:** Let the model handle punctuation automatically (no spoken commands).
- **Privacy/Connectivity:** 100% local at runtime; model downloads only when the user explicitly requests. **No telemetry**.
- **Distribution:** **.app/.dmg** (signed + notarized), outside the Mac App Store initially.
- **UI languages:** **ES/EN**.
- **Low-power mode:** Still allow downloads if the user starts them.
- **License:** **MIT**.
- **Per-dictation limit:** **10 minutes** by default (configurable).
---
## 1) Goal
A **menu bar** app for macOS that performs **offline speech-to-text** using Whisper-family models and **inserts the transcribed text** into whichever app currently has focus. Shows a minimal **HUD** while listening and processing. No internet required during normal operation.
---
## 2) MVP Scope
- Persistent **menu bar** item (NSStatusItem / `MenuBarExtra`).
- **Global hotkey** (push-to-talk and toggle modes).
- **HUD** (centered NSPanel + SwiftUI):
- “Listening” with audio-level animation (RMS/peak).
- “Processing” with a spinner/animation.
- **Offline STT** with **whisper.cpp** (GGUF models; Metal acceleration on Apple Silicon).
- **Model Manager**: curated list, manual download with progress + SHA256 check, user selection.
- **Text injection**:
- Preferred: **Clipboard + ⌘V** paste.
- Fallback: **simulated typing** via CGEvent.
- If **Secure Input** is active, **do not inject**; show notice and keep text on clipboard.
- **Preferences**: hotkey & mode, model & language, insertion method, HUD styling, sounds, dictation limit.
- **Permissions onboarding**: Microphone, Accessibility, Input Monitoring.
---
## 3) Functional Requirements
### 3.1 Capture
- Prompt for permissions on first use.
- Global hotkey (default ⌘⇧V).
- **Push-to-talk**: start on key down, stop on key up.
- **Toggle**: press to start, press again to stop.
- Per-dictation limit (default 10 min, range 10 s30 min).
### 3.2 HUD / UX
- Non-activating, centered **NSPanel** (~320×160), no focus stealing.
- **Listening**: bar-style audio visualization driven by live RMS/peak.
- **Processing**: spinner + “Transcribing…” label.
- **Esc** to cancel.
- Optional start/stop sounds (user-toggleable).
### 3.3 STT
- Backend A (MVP): **whisper.cpp** with **GGUF** and **Metal**.
- Language: auto-detect or forced (persisted).
- Basic text normalization; punctuation from the model.
- UTF-8 output; standard replacements (quotes, dashes, etc.).
### 3.4 Injection
- Preferred method: **NSPasteboard** + **CGEvent** to send ⌘V.
- Fallback: **CGEventCreateKeyboardEvent** (character-by-character), respecting active keyboard layout.
- **Secure Input**: detect with `IsSecureEventInputEnabled()`; if enabled, **do not inject**. Show a non-intrusive notice and leave the text on the clipboard.
### 3.5 Preferences
- **General:** hotkey + mode (push/toggle), sounds, HUD options.
- **Models:** catalog, download, select active model, language, local storage path.
- **Insertion:** direct vs preview (preview **off** by default), paste vs type.
- **Advanced:** limits, performance knobs (threads/batch), **local** logs opt-in.
---
## 4) Non-Functional Requirements
- **Offline** execution after models are installed.
- **Latency target** (M1 + “small” model): < 4 s for 10 s of audio.
- **Memory target:** ~1.52.5 GB with “small”.
- **Privacy:** audio and text never leave the device.
- **Accessibility:** sufficient contrast; VoiceOver labels; focus never stolen by HUD.
---
## 5) Architecture (High-Level)
- **App (SwiftUI)** with AppKit bridges for NSStatusItem and NSPanel.
- **Shortcut Manager** (Carbon `RegisterEventHotKey` or HotKey/MASShortcut).
- **Audio**: AVAudioEngine (downsample to 16 kHz mono, 16-bit PCM).
- **STT Engine**:
- **whisper.cpp** (C/C++ via SPM/CMake) with Metal.
- **Core ML backend** (e.g., WhisperKit / custom) in a later phase.
- **Model Manager**: curated catalog, downloads (progress + SHA256), selection, caching.
- **Text Injection**: pasteboard + CGEvent; typing fallback; Secure Input detection.
- **Permissions Manager**: guided flows to System Settings panes.
- **Settings**: UserDefaults + JSON export/import.
- **Packaging**: .app + .dmg (signed & notarized).
---
## 6) Main Flow
1. User presses global hotkey.
2. Check permissions; guide if missing.
3. Show HUD → **Listening**; start capture.
4. Stop (key up/toggle/timeout).
5. HUD → **Processing**; run STT in background.
6. On result → (optional preview) → **insert** (paste) or **fallback** (type). If Secure Input, **do not inject**; keep in clipboard + show notice.
7. Close HUD → **Idle**.
---
## 7) Finite State Machine (FSM)
- **Idle** → (Hotkey) → **Listening**
- **Listening** → (Stop/Timeout) → **Processing**
- **Processing** → (Done) → **Injecting**
- **Injecting** → (Done) → **Idle**
- Any → (Error) → **ErrorModal** → **Idle**
---
## 8) Model Management (Manual Downloads)
**Goal:** Offer a clear list of **free** Whisper-family models (names, sizes, languages, recommended backend) with one-click downloads. No automatic downloads.
### 8.1 OpenAI Whisper (official weights)
- Families: **tiny**, **base**, **small**, **medium**, **large-v2**, **large-v3** (multilingual; some `.en` variants).
- Usable with **whisper.cpp** via **GGUF** (community conversions widely available).
### 8.2 Whisper for whisper.cpp (converted GGUF)
- Community-maintained conversions for whisper.cpp (GGUF), optimized for CPU/GPU Metal on macOS.
### 8.3 Faster-Whisper (CTranslate2)
- Optimized variants (tiny/base/small/medium/large-v2/large-v3). Useful if a CT2-based or Core-ML-assisted backend is added later.
### 8.4 Distil-Whisper (distilled)
- Distilled models (e.g., **distil-large-v2/v3/v3.5**, **distil-small.en**), significantly smaller/faster with near-large accuracy.
> **UI must show:** model file size, languages, license, **RAM estimate**, and a warning if a large model is selected on lower-memory machines.
**Optional JSON Schema for catalog entries (for the apps first-run picker):**
```json
{
"name": "whisper-small",
"family": "OpenAI-Whisper",
"format": "gguf",
"size_mb": 466,
"languages": ["multilingual"],
"recommended_backend": "whisper.cpp",
"quality_tier": "small",
"license": "MIT",
"sha256": "…",
"download_url": "…",
"notes": "Good balance of speed/accuracy on M1/M2."
}
```
---
## 9) Security & Permissions
* **Info.plist:** `NSMicrophoneUsageDescription`.
* **Accessibility & Input Monitoring:** required for CGEvent; provide clear step-by-step guidance and deep-links.
* **Secure Input:** check `IsSecureEventInputEnabled()`; **never** attempt to bypass. Provide help text to identify apps that enable it (password fields, 2FA prompts, etc.).
---
## 10) Performance
* Lazy-load and reuse model (warm cache).
* Real-time downsampling to 16 kHz mono; chunked streaming into backend.
* Configurable threads; prefer **Metal** path on Apple Silicon.
* “Fast path” tweaks for short clips (<15 s).
---
## 11) Logging & Privacy
* **No remote telemetry.**
* Local logs **opt-in** (timings, errors only). Never store audio/text unless user explicitly enables a debug flag.
* “Wipe local data” button (models remain unless the user removes them).
---
## 12) Internationalization
* UI in **Spanish** and **English** (Localizable.strings).
* STT multilingual; language auto or forced per user preference.
---
## 13) Testing (Minimum)
* macOS 13/14/15 on M1/M2/M3.
* Injection works in Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.
* **Secure Input**: correctly detected; no injection; clipboard + notice.
* Meet latency target with **small** model on M1.
* Model download & selection flows (simulate network errors).
---
## 14) Phased Plan (AI-Deliverables)
### Phase 0 — Scaffolding (MVP-0)
**Goal:** Base project + menubar.
**Deliverables:**
* SwiftUI app with `MenuBarExtra`, microphone icon, “Idle” state.
* `ARCHITECTURE.md` describing modules (Audio/STT/Injection/Models/Permissions/Settings).
* Build scripts and signing/notarization templates.
**DoD:** Compiles; menu bar item visible; SPM structure ready.
---
### Phase 1 — Hotkey + HUD + Audio (MVP-1)
**Goal:** Listening UX without real STT.
**Deliverables:**
* Global hotkey (default ⌘⇧V) with **push** and **toggle**.
* NSPanel HUD (Listening/Processing) + **real** RMS bars from AVAudioEngine.
* Per-dictation limit (default 10 min).
**DoD:** Live meter responds to mic; correct state transitions.
---
### Phase 2 — STT via whisper.cpp (MVP-2)
**Goal:** Real offline transcription.
**Deliverables:**
* **whisper.cpp** module (C/C++), background inference with **Metal**.
* **Model Manager** (curated list, download with SHA256, selection).
* Language auto/forced; basic normalization.
**DoD:** 10-second clip → coherent ES/EN text offline; meets timing targets.
---
### Phase 3 — Robust Insertion (MVP-3)
**Goal:** Reliable insertion into focused app.
**Deliverables:**
* Paste (clipboard + ⌘V) and typing fallback.
* **Secure Input** detection; safe behavior (no injection, clipboard + notice).
**DoD:** Works across target apps; correct Secure Input handling.
---
### Phase 4 — Preferences + UX Polish (MVP-4)
**Goal:** Complete options & stability.
**Deliverables:**
* Full Preferences (hotkey, modes, model, language, insertion, HUD, sounds).
* Optional preview dialog (off by default).
* Config export/import (JSON).
**DoD:** All settings persist and are honored.
---
### Phase 5 — Distribution (MVP-5)
**Goal:** Installable package.
**Deliverables:**
* Error handling; permission prompts & help (incl. Secure Input troubleshooting).
* **.dmg** (signed + notarized) and install guide.
* **USER\_GUIDE.md** + **TROUBLESHOOTING.md**.
**DoD:** Clean install on test machines; distribution checklist passed.
---
### Phase 6 — Core ML Backend (Post-MVP)
**Goal:** Second backend.
**Deliverables:**
* **Core ML** integration (e.g., WhisperKit or custom conversion).
* Backend selector (whisper.cpp/Core ML) in Preferences; local benchmarks table.
**DoD:** Feature parity and stability; documented pros/cons.
---
## 15) Mini-Prompts for the Builder AI (per Phase)
* **P0:** “Create macOS 13+ SwiftUI menubar app (`MenuBarExtra`), microphone icon, SPM layout with modules in `ARCHITECTURE.md`.”
* **P1:** “Add global hotkey (push & toggle) with `RegisterEventHotKey`; NSPanel HUD with RMS bars from AVAudioEngine; 10-minute dictation limit.”
* **P2:** “Integrate **whisper.cpp** (Metal); add Model Manager (curated list, SHA256-verified downloads, selection); language auto/forced; transcribe WAV 16 kHz mono.”
* **P3:** “Implement insertion: pasteboard+⌘V and CGEvent typing fallback; detect `IsSecureEventInputEnabled()` and avoid injection.”
* **P4:** “Implement full Preferences, optional preview, JSON export/import; UX polish and messages.”
* **P5:** “Signing + notarization; produce .dmg; write USER\_GUIDE and TROUBLESHOOTING (with Secure Input section).”
* **P6:** “Add Core ML backend (WhisperKit/custom), backend selector, and local benchmarks.”
---
## 16) Suggested Repo Layout
```
MenuWhisper/
Sources/
App/ # SwiftUI + AppKit bridges
Core/
Audio/ # AVAudioEngine capture + meters
STT/
WhisperCPP/ # C/C++ wrapper + Metal path
CoreML/ # post-MVP
Models/ # catalog, downloads, hashes
Injection/ # clipboard, CGEvent typing, secure input checks
Permissions/
Settings/
Utils/
Resources/ # icons, sounds, localizations
Docs/ # ARCHITECTURE.md, USER_GUIDE.md, TROUBLESHOOTING.md
Scripts/ # build, sign, notarize
Tests/ # unit + integration
```
---
## 17) Risks & Mitigations
* **Hotkey collision (⌘⇧V)** with “Paste and Match Style” in some apps → make it discoverable & easily rebindable; warn on conflict.
* **Secure Input** blocks injection → inform the user, keep text on clipboard, provide help to identify the app enabling it.
* **RAM/latency** with large models → recommend **small/base** by default; show RAM/latency hints in the model picker.
* **Keyboard layouts** → prefer paste; if typing, map using the active layout.
---
## 18) Global MVP Definition of Done
* A 3090 s dictation yields accurate ES/EN text **offline** and inserts correctly in common apps.
* Secure Input is correctly detected and handled.
* Model download/selection is robust and user-driven.
* Shippable **.dmg** (signed + notarized) and clear docs included.