tellme/TECHSPEC.md
2025-09-18 19:56:06 +02:00

335 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Technical Definition — “Menu-Whisper” (macOS, Swift, Offline STT)
## 0) Owner Decisions (Locked)
- **Platform:** Apple Silicon only (M1/M2/M3), macOS 13+.
- **STT backends:** Start with **whisper.cpp (Metal)** for simplicity; add **Core ML** backend later.
- **Models:** Do **not** auto-download. On first run, user **chooses & downloads** a model.
- **VAD:** Post-MVP.
- **Insertion behavior:** Configurable; **direct insertion** is default (no preview).
- **Default hotkey:** **⌘⇧V** (user-configurable).
- **Punctuation:** Let the model handle punctuation automatically (no spoken commands).
- **Privacy/Connectivity:** 100% local at runtime; model downloads only when the user explicitly requests. **No telemetry**.
- **Distribution:** **.app/.dmg** (signed + notarized), outside the Mac App Store initially.
- **UI languages:** **ES/EN**.
- **Low-power mode:** Still allow downloads if the user starts them.
- **License:** **MIT**.
- **Per-dictation limit:** **10 minutes** by default (configurable).
---
## 1) Goal
A **menu bar** app for macOS that performs **offline speech-to-text** using Whisper-family models and **inserts the transcribed text** into whichever app currently has focus. Shows a minimal **HUD** while listening and processing. No internet required during normal operation.
---
## 2) MVP Scope
- Persistent **menu bar** item (NSStatusItem / `MenuBarExtra`).
- **Global hotkey** (push-to-talk and toggle modes).
- **HUD** (centered NSPanel + SwiftUI):
- “Listening” with audio-level animation (RMS/peak).
- “Processing” with a spinner/animation.
- **Offline STT** with **whisper.cpp** (GGUF models; Metal acceleration on Apple Silicon).
- **Model Manager**: curated list, manual download with progress + SHA256 check, user selection.
- **Text injection**:
- Preferred: **Clipboard + ⌘V** paste.
- Fallback: **simulated typing** via CGEvent.
- If **Secure Input** is active, **do not inject**; show notice and keep text on clipboard.
- **Preferences**: hotkey & mode, model & language, insertion method, HUD styling, sounds, dictation limit.
- **Permissions onboarding**: Microphone, Accessibility, Input Monitoring.
---
## 3) Functional Requirements
### 3.1 Capture
- Prompt for permissions on first use.
- Global hotkey (default ⌘⇧V).
- **Push-to-talk**: start on key down, stop on key up.
- **Toggle**: press to start, press again to stop.
- Per-dictation limit (default 10 min, range 10 s30 min).
### 3.2 HUD / UX
- Non-activating, centered **NSPanel** (~320×160), no focus stealing.
- **Listening**: bar-style audio visualization driven by live RMS/peak.
- **Processing**: spinner + “Transcribing…” label.
- **Esc** to cancel.
- Optional start/stop sounds (user-toggleable).
### 3.3 STT
- Backend A (MVP): **whisper.cpp** with **GGUF** and **Metal**.
- Language: auto-detect or forced (persisted).
- Basic text normalization; punctuation from the model.
- UTF-8 output; standard replacements (quotes, dashes, etc.).
### 3.4 Injection
- Preferred method: **NSPasteboard** + **CGEvent** to send ⌘V.
- Fallback: **CGEventCreateKeyboardEvent** (character-by-character), respecting active keyboard layout.
- **Secure Input**: detect with `IsSecureEventInputEnabled()`; if enabled, **do not inject**. Show a non-intrusive notice and leave the text on the clipboard.
### 3.5 Preferences
- **General:** hotkey + mode (push/toggle), sounds, HUD options.
- **Models:** catalog, download, select active model, language, local storage path.
- **Insertion:** direct vs preview (preview **off** by default), paste vs type.
- **Advanced:** limits, performance knobs (threads/batch), **local** logs opt-in.
---
## 4) Non-Functional Requirements
- **Offline** execution after models are installed.
- **Latency target** (M1 + “small” model): < 4 s for 10 s of audio.
- **Memory target:** ~1.52.5 GB with small”.
- **Privacy:** audio and text never leave the device.
- **Accessibility:** sufficient contrast; VoiceOver labels; focus never stolen by HUD.
---
## 5) Architecture (High-Level)
- **App (SwiftUI)** with AppKit bridges for NSStatusItem and NSPanel.
- **Shortcut Manager** (Carbon `RegisterEventHotKey` or HotKey/MASShortcut).
- **Audio**: AVAudioEngine (downsample to 16 kHz mono, 16-bit PCM).
- **STT Engine**:
- **whisper.cpp** (C/C++ via SPM/CMake) with Metal.
- **Core ML backend** (e.g., WhisperKit / custom) in a later phase.
- **Model Manager**: curated catalog, downloads (progress + SHA256), selection, caching.
- **Text Injection**: pasteboard + CGEvent; typing fallback; Secure Input detection.
- **Permissions Manager**: guided flows to System Settings panes.
- **Settings**: UserDefaults + JSON export/import.
- **Packaging**: .app + .dmg (signed & notarized).
---
## 6) Main Flow
1. User presses global hotkey.
2. Check permissions; guide if missing.
3. Show HUD **Listening**; start capture.
4. Stop (key up/toggle/timeout).
5. HUD **Processing**; run STT in background.
6. On result (optional preview) **insert** (paste) or **fallback** (type). If Secure Input, **do not inject**; keep in clipboard + show notice.
7. Close HUD **Idle**.
---
## 7) Finite State Machine (FSM)
- **Idle** (Hotkey) **Listening**
- **Listening** (Stop/Timeout) **Processing**
- **Processing** (Done) **Injecting**
- **Injecting** (Done) **Idle**
- Any (Error) **ErrorModal** **Idle**
---
## 8) Model Management (Manual Downloads)
**Goal:** Offer a clear list of **free** Whisper-family models (names, sizes, languages, recommended backend) with one-click downloads. No automatic downloads.
### 8.1 OpenAI Whisper (official weights)
- Families: **tiny**, **base**, **small**, **medium**, **large-v2**, **large-v3** (multilingual; some `.en` variants).
- Usable with **whisper.cpp** via **GGUF** (community conversions widely available).
### 8.2 Whisper for whisper.cpp (converted GGUF)
- Community-maintained conversions for whisper.cpp (GGUF), optimized for CPU/GPU Metal on macOS.
### 8.3 Faster-Whisper (CTranslate2)
- Optimized variants (tiny/base/small/medium/large-v2/large-v3). Useful if a CT2-based or Core-ML-assisted backend is added later.
### 8.4 Distil-Whisper (distilled)
- Distilled models (e.g., **distil-large-v2/v3/v3.5**, **distil-small.en**), significantly smaller/faster with near-large accuracy.
> **UI must show:** model file size, languages, license, **RAM estimate**, and a warning if a large model is selected on lower-memory machines.
**Optional JSON Schema for catalog entries (for the apps first-run picker):**
```json
{
"name": "whisper-small",
"family": "OpenAI-Whisper",
"format": "gguf",
"size_mb": 466,
"languages": ["multilingual"],
"recommended_backend": "whisper.cpp",
"quality_tier": "small",
"license": "MIT",
"sha256": "…",
"download_url": "…",
"notes": "Good balance of speed/accuracy on M1/M2."
}
```
---
## 9) Security & Permissions
* **Info.plist:** `NSMicrophoneUsageDescription`.
* **Accessibility & Input Monitoring:** required for CGEvent; provide clear step-by-step guidance and deep-links.
* **Secure Input:** check `IsSecureEventInputEnabled()`; **never** attempt to bypass. Provide help text to identify apps that enable it (password fields, 2FA prompts, etc.).
---
## 10) Performance
* Lazy-load and reuse model (warm cache).
* Real-time downsampling to 16 kHz mono; chunked streaming into backend.
* Configurable threads; prefer **Metal** path on Apple Silicon.
* Fast path tweaks for short clips (<15 s).
---
## 11) Logging & Privacy
* **No remote telemetry.**
* Local logs **opt-in** (timings, errors only). Never store audio/text unless user explicitly enables a debug flag.
* Wipe local data button (models remain unless the user removes them).
---
## 12) Internationalization
* UI in **Spanish** and **English** (Localizable.strings).
* STT multilingual; language auto or forced per user preference.
---
## 13) Testing (Minimum)
* macOS 13/14/15 on M1/M2/M3.
* Injection works in Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.
* **Secure Input**: correctly detected; no injection; clipboard + notice.
* Meet latency target with **small** model on M1.
* Model download & selection flows (simulate network errors).
---
## 14) Phased Plan (AI-Deliverables)
### Phase 0 — Scaffolding (MVP-0)
**Goal:** Base project + menubar.
**Deliverables:**
* SwiftUI app with `MenuBarExtra`, microphone icon, Idle state.
* `ARCHITECTURE.md` describing modules (Audio/STT/Injection/Models/Permissions/Settings).
* Build scripts and signing/notarization templates.
**DoD:** Compiles; menu bar item visible; SPM structure ready.
---
### Phase 1 — Hotkey + HUD + Audio (MVP-1)
**Goal:** Listening UX without real STT.
**Deliverables:**
* Global hotkey (default ⌘⇧V) with **push** and **toggle**.
* NSPanel HUD (Listening/Processing) + **real** RMS bars from AVAudioEngine.
* Per-dictation limit (default 10 min).
**DoD:** Live meter responds to mic; correct state transitions.
---
### Phase 2 — STT via whisper.cpp (MVP-2)
**Goal:** Real offline transcription.
**Deliverables:**
* **whisper.cpp** module (C/C++), background inference with **Metal**.
* **Model Manager** (curated list, download with SHA256, selection).
* Language auto/forced; basic normalization.
**DoD:** 10-second clip coherent ES/EN text offline; meets timing targets.
---
### Phase 3 — Robust Insertion (MVP-3)
**Goal:** Reliable insertion into focused app.
**Deliverables:**
* Paste (clipboard + V) and typing fallback.
* **Secure Input** detection; safe behavior (no injection, clipboard + notice).
**DoD:** Works across target apps; correct Secure Input handling.
---
### Phase 4 — Preferences + UX Polish (MVP-4)
**Goal:** Complete options & stability.
**Deliverables:**
* Full Preferences (hotkey, modes, model, language, insertion, HUD, sounds).
* Optional preview dialog (off by default).
* Config export/import (JSON).
**DoD:** All settings persist and are honored.
---
### Phase 5 — Distribution (MVP-5)
**Goal:** Installable package.
**Deliverables:**
* Error handling; permission prompts & help (incl. Secure Input troubleshooting).
* **.dmg** (signed + notarized) and install guide.
* **USER\_GUIDE.md** + **TROUBLESHOOTING.md**.
**DoD:** Clean install on test machines; distribution checklist passed.
---
### Phase 6 — Core ML Backend (Post-MVP)
**Goal:** Second backend.
**Deliverables:**
* **Core ML** integration (e.g., WhisperKit or custom conversion).
* Backend selector (whisper.cpp/Core ML) in Preferences; local benchmarks table.
**DoD:** Feature parity and stability; documented pros/cons.
---
## 15) Mini-Prompts for the Builder AI (per Phase)
* **P0:** Create macOS 13+ SwiftUI menubar app (`MenuBarExtra`), microphone icon, SPM layout with modules in `ARCHITECTURE.md`.”
* **P1:** Add global hotkey (push & toggle) with `RegisterEventHotKey`; NSPanel HUD with RMS bars from AVAudioEngine; 10-minute dictation limit.”
* **P2:** Integrate **whisper.cpp** (Metal); add Model Manager (curated list, SHA256-verified downloads, selection); language auto/forced; transcribe WAV 16 kHz mono.”
* **P3:** Implement insertion: pasteboard+⌘V and CGEvent typing fallback; detect `IsSecureEventInputEnabled()` and avoid injection.”
* **P4:** Implement full Preferences, optional preview, JSON export/import; UX polish and messages.”
* **P5:** Signing + notarization; produce .dmg; write USER\_GUIDE and TROUBLESHOOTING (with Secure Input section).”
* **P6:** Add Core ML backend (WhisperKit/custom), backend selector, and local benchmarks.”
---
## 16) Suggested Repo Layout
```
MenuWhisper/
Sources/
App/ # SwiftUI + AppKit bridges
Core/
Audio/ # AVAudioEngine capture + meters
STT/
WhisperCPP/ # C/C++ wrapper + Metal path
CoreML/ # post-MVP
Models/ # catalog, downloads, hashes
Injection/ # clipboard, CGEvent typing, secure input checks
Permissions/
Settings/
Utils/
Resources/ # icons, sounds, localizations
Docs/ # ARCHITECTURE.md, USER_GUIDE.md, TROUBLESHOOTING.md
Scripts/ # build, sign, notarize
Tests/ # unit + integration
```
---
## 17) Risks & Mitigations
* **Hotkey collision (⌘⇧V)** with Paste and Match Style in some apps make it discoverable & easily rebindable; warn on conflict.
* **Secure Input** blocks injection inform the user, keep text on clipboard, provide help to identify the app enabling it.
* **RAM/latency** with large models recommend **small/base** by default; show RAM/latency hints in the model picker.
* **Keyboard layouts** prefer paste; if typing, map using the active layout.
---
## 18) Global MVP Definition of Done
* A 3090 s dictation yields accurate ES/EN text **offline** and inserts correctly in common apps.
* Secure Input is correctly detected and handled.
* Model download/selection is robust and user-driven.
* Shippable **.dmg** (signed + notarized) and clear docs included.