Initial commit
This commit is contained in:
commit
1db16227b2
31 changed files with 2175 additions and 0 deletions
335
TECHSPEC.md
Normal file
335
TECHSPEC.md
Normal file
|
|
@ -0,0 +1,335 @@
|
|||
# Technical Definition — “Menu-Whisper” (macOS, Swift, Offline STT)
|
||||
|
||||
## 0) Owner Decisions (Locked)
|
||||
- **Platform:** Apple Silicon only (M1/M2/M3), macOS 13+.
|
||||
- **STT backends:** Start with **whisper.cpp (Metal)** for simplicity; add **Core ML** backend later.
|
||||
- **Models:** Do **not** auto-download. On first run, user **chooses & downloads** a model.
|
||||
- **VAD:** Post-MVP.
|
||||
- **Insertion behavior:** Configurable; **direct insertion** is default (no preview).
|
||||
- **Default hotkey:** **⌘⇧V** (user-configurable).
|
||||
- **Punctuation:** Let the model handle punctuation automatically (no spoken commands).
|
||||
- **Privacy/Connectivity:** 100% local at runtime; model downloads only when the user explicitly requests. **No telemetry**.
|
||||
- **Distribution:** **.app/.dmg** (signed + notarized), outside the Mac App Store initially.
|
||||
- **UI languages:** **ES/EN**.
|
||||
- **Low-power mode:** Still allow downloads if the user starts them.
|
||||
- **License:** **MIT**.
|
||||
- **Per-dictation limit:** **10 minutes** by default (configurable).
|
||||
|
||||
---
|
||||
|
||||
## 1) Goal
|
||||
A **menu bar** app for macOS that performs **offline speech-to-text** using Whisper-family models and **inserts the transcribed text** into whichever app currently has focus. Shows a minimal **HUD** while listening and processing. No internet required during normal operation.
|
||||
|
||||
---
|
||||
|
||||
## 2) MVP Scope
|
||||
- Persistent **menu bar** item (NSStatusItem / `MenuBarExtra`).
|
||||
- **Global hotkey** (push-to-talk and toggle modes).
|
||||
- **HUD** (centered NSPanel + SwiftUI):
|
||||
- “Listening” with audio-level animation (RMS/peak).
|
||||
- “Processing” with a spinner/animation.
|
||||
- **Offline STT** with **whisper.cpp** (GGUF models; Metal acceleration on Apple Silicon).
|
||||
- **Model Manager**: curated list, manual download with progress + SHA256 check, user selection.
|
||||
- **Text injection**:
|
||||
- Preferred: **Clipboard + ⌘V** paste.
|
||||
- Fallback: **simulated typing** via CGEvent.
|
||||
- If **Secure Input** is active, **do not inject**; show notice and keep text on clipboard.
|
||||
- **Preferences**: hotkey & mode, model & language, insertion method, HUD styling, sounds, dictation limit.
|
||||
- **Permissions onboarding**: Microphone, Accessibility, Input Monitoring.
|
||||
|
||||
---
|
||||
|
||||
## 3) Functional Requirements
|
||||
|
||||
### 3.1 Capture
|
||||
- Prompt for permissions on first use.
|
||||
- Global hotkey (default ⌘⇧V).
|
||||
- **Push-to-talk**: start on key down, stop on key up.
|
||||
- **Toggle**: press to start, press again to stop.
|
||||
- Per-dictation limit (default 10 min, range 10 s–30 min).
|
||||
|
||||
### 3.2 HUD / UX
|
||||
- Non-activating, centered **NSPanel** (~320×160), no focus stealing.
|
||||
- **Listening**: bar-style audio visualization driven by live RMS/peak.
|
||||
- **Processing**: spinner + “Transcribing…” label.
|
||||
- **Esc** to cancel.
|
||||
- Optional start/stop sounds (user-toggleable).
|
||||
|
||||
### 3.3 STT
|
||||
- Backend A (MVP): **whisper.cpp** with **GGUF** and **Metal**.
|
||||
- Language: auto-detect or forced (persisted).
|
||||
- Basic text normalization; punctuation from the model.
|
||||
- UTF-8 output; standard replacements (quotes, dashes, etc.).
|
||||
|
||||
### 3.4 Injection
|
||||
- Preferred method: **NSPasteboard** + **CGEvent** to send ⌘V.
|
||||
- Fallback: **CGEventCreateKeyboardEvent** (character-by-character), respecting active keyboard layout.
|
||||
- **Secure Input**: detect with `IsSecureEventInputEnabled()`; if enabled, **do not inject**. Show a non-intrusive notice and leave the text on the clipboard.
|
||||
|
||||
### 3.5 Preferences
|
||||
- **General:** hotkey + mode (push/toggle), sounds, HUD options.
|
||||
- **Models:** catalog, download, select active model, language, local storage path.
|
||||
- **Insertion:** direct vs preview (preview **off** by default), paste vs type.
|
||||
- **Advanced:** limits, performance knobs (threads/batch), **local** logs opt-in.
|
||||
|
||||
---
|
||||
|
||||
## 4) Non-Functional Requirements
|
||||
- **Offline** execution after models are installed.
|
||||
- **Latency target** (M1 + “small” model): < 4 s for 10 s of audio.
|
||||
- **Memory target:** ~1.5–2.5 GB with “small”.
|
||||
- **Privacy:** audio and text never leave the device.
|
||||
- **Accessibility:** sufficient contrast; VoiceOver labels; focus never stolen by HUD.
|
||||
|
||||
---
|
||||
|
||||
## 5) Architecture (High-Level)
|
||||
- **App (SwiftUI)** with AppKit bridges for NSStatusItem and NSPanel.
|
||||
- **Shortcut Manager** (Carbon `RegisterEventHotKey` or HotKey/MASShortcut).
|
||||
- **Audio**: AVAudioEngine (downsample to 16 kHz mono, 16-bit PCM).
|
||||
- **STT Engine**:
|
||||
- **whisper.cpp** (C/C++ via SPM/CMake) with Metal.
|
||||
- **Core ML backend** (e.g., WhisperKit / custom) in a later phase.
|
||||
- **Model Manager**: curated catalog, downloads (progress + SHA256), selection, caching.
|
||||
- **Text Injection**: pasteboard + CGEvent; typing fallback; Secure Input detection.
|
||||
- **Permissions Manager**: guided flows to System Settings panes.
|
||||
- **Settings**: UserDefaults + JSON export/import.
|
||||
- **Packaging**: .app + .dmg (signed & notarized).
|
||||
|
||||
---
|
||||
|
||||
## 6) Main Flow
|
||||
1. User presses global hotkey.
|
||||
2. Check permissions; guide if missing.
|
||||
3. Show HUD → **Listening**; start capture.
|
||||
4. Stop (key up/toggle/timeout).
|
||||
5. HUD → **Processing**; run STT in background.
|
||||
6. On result → (optional preview) → **insert** (paste) or **fallback** (type). If Secure Input, **do not inject**; keep in clipboard + show notice.
|
||||
7. Close HUD → **Idle**.
|
||||
|
||||
---
|
||||
|
||||
## 7) Finite State Machine (FSM)
|
||||
- **Idle** → (Hotkey) → **Listening**
|
||||
- **Listening** → (Stop/Timeout) → **Processing**
|
||||
- **Processing** → (Done) → **Injecting**
|
||||
- **Injecting** → (Done) → **Idle**
|
||||
- Any → (Error) → **ErrorModal** → **Idle**
|
||||
|
||||
---
|
||||
|
||||
## 8) Model Management (Manual Downloads)
|
||||
**Goal:** Offer a clear list of **free** Whisper-family models (names, sizes, languages, recommended backend) with one-click downloads. No automatic downloads.
|
||||
|
||||
### 8.1 OpenAI Whisper (official weights)
|
||||
- Families: **tiny**, **base**, **small**, **medium**, **large-v2**, **large-v3** (multilingual; some `.en` variants).
|
||||
- Usable with **whisper.cpp** via **GGUF** (community conversions widely available).
|
||||
|
||||
### 8.2 Whisper for whisper.cpp (converted GGUF)
|
||||
- Community-maintained conversions for whisper.cpp (GGUF), optimized for CPU/GPU Metal on macOS.
|
||||
|
||||
### 8.3 Faster-Whisper (CTranslate2)
|
||||
- Optimized variants (tiny/base/small/medium/large-v2/large-v3). Useful if a CT2-based or Core-ML-assisted backend is added later.
|
||||
|
||||
### 8.4 Distil-Whisper (distilled)
|
||||
- Distilled models (e.g., **distil-large-v2/v3/v3.5**, **distil-small.en**), significantly smaller/faster with near-large accuracy.
|
||||
|
||||
> **UI must show:** model file size, languages, license, **RAM estimate**, and a warning if a large model is selected on lower-memory machines.
|
||||
|
||||
**Optional JSON Schema for catalog entries (for the app’s first-run picker):**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "whisper-small",
|
||||
"family": "OpenAI-Whisper",
|
||||
"format": "gguf",
|
||||
"size_mb": 466,
|
||||
"languages": ["multilingual"],
|
||||
"recommended_backend": "whisper.cpp",
|
||||
"quality_tier": "small",
|
||||
"license": "MIT",
|
||||
"sha256": "…",
|
||||
"download_url": "…",
|
||||
"notes": "Good balance of speed/accuracy on M1/M2."
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9) Security & Permissions
|
||||
|
||||
* **Info.plist:** `NSMicrophoneUsageDescription`.
|
||||
* **Accessibility & Input Monitoring:** required for CGEvent; provide clear step-by-step guidance and deep-links.
|
||||
* **Secure Input:** check `IsSecureEventInputEnabled()`; **never** attempt to bypass. Provide help text to identify apps that enable it (password fields, 2FA prompts, etc.).
|
||||
|
||||
---
|
||||
|
||||
## 10) Performance
|
||||
|
||||
* Lazy-load and reuse model (warm cache).
|
||||
* Real-time downsampling to 16 kHz mono; chunked streaming into backend.
|
||||
* Configurable threads; prefer **Metal** path on Apple Silicon.
|
||||
* “Fast path” tweaks for short clips (<15 s).
|
||||
|
||||
---
|
||||
|
||||
## 11) Logging & Privacy
|
||||
|
||||
* **No remote telemetry.**
|
||||
* Local logs **opt-in** (timings, errors only). Never store audio/text unless user explicitly enables a debug flag.
|
||||
* “Wipe local data” button (models remain unless the user removes them).
|
||||
|
||||
---
|
||||
|
||||
## 12) Internationalization
|
||||
|
||||
* UI in **Spanish** and **English** (Localizable.strings).
|
||||
* STT multilingual; language auto or forced per user preference.
|
||||
|
||||
---
|
||||
|
||||
## 13) Testing (Minimum)
|
||||
|
||||
* macOS 13/14/15 on M1/M2/M3.
|
||||
* Injection works in Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.
|
||||
* **Secure Input**: correctly detected; no injection; clipboard + notice.
|
||||
* Meet latency target with **small** model on M1.
|
||||
* Model download & selection flows (simulate network errors).
|
||||
|
||||
---
|
||||
|
||||
## 14) Phased Plan (AI-Deliverables)
|
||||
|
||||
### Phase 0 — Scaffolding (MVP-0)
|
||||
|
||||
**Goal:** Base project + menubar.
|
||||
**Deliverables:**
|
||||
|
||||
* SwiftUI app with `MenuBarExtra`, microphone icon, “Idle” state.
|
||||
* `ARCHITECTURE.md` describing modules (Audio/STT/Injection/Models/Permissions/Settings).
|
||||
* Build scripts and signing/notarization templates.
|
||||
**DoD:** Compiles; menu bar item visible; SPM structure ready.
|
||||
|
||||
---
|
||||
|
||||
### Phase 1 — Hotkey + HUD + Audio (MVP-1)
|
||||
|
||||
**Goal:** Listening UX without real STT.
|
||||
**Deliverables:**
|
||||
|
||||
* Global hotkey (default ⌘⇧V) with **push** and **toggle**.
|
||||
* NSPanel HUD (Listening/Processing) + **real** RMS bars from AVAudioEngine.
|
||||
* Per-dictation limit (default 10 min).
|
||||
**DoD:** Live meter responds to mic; correct state transitions.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — STT via whisper.cpp (MVP-2)
|
||||
|
||||
**Goal:** Real offline transcription.
|
||||
**Deliverables:**
|
||||
|
||||
* **whisper.cpp** module (C/C++), background inference with **Metal**.
|
||||
* **Model Manager** (curated list, download with SHA256, selection).
|
||||
* Language auto/forced; basic normalization.
|
||||
**DoD:** 10-second clip → coherent ES/EN text offline; meets timing targets.
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — Robust Insertion (MVP-3)
|
||||
|
||||
**Goal:** Reliable insertion into focused app.
|
||||
**Deliverables:**
|
||||
|
||||
* Paste (clipboard + ⌘V) and typing fallback.
|
||||
* **Secure Input** detection; safe behavior (no injection, clipboard + notice).
|
||||
**DoD:** Works across target apps; correct Secure Input handling.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — Preferences + UX Polish (MVP-4)
|
||||
|
||||
**Goal:** Complete options & stability.
|
||||
**Deliverables:**
|
||||
|
||||
* Full Preferences (hotkey, modes, model, language, insertion, HUD, sounds).
|
||||
* Optional preview dialog (off by default).
|
||||
* Config export/import (JSON).
|
||||
**DoD:** All settings persist and are honored.
|
||||
|
||||
---
|
||||
|
||||
### Phase 5 — Distribution (MVP-5)
|
||||
|
||||
**Goal:** Installable package.
|
||||
**Deliverables:**
|
||||
|
||||
* Error handling; permission prompts & help (incl. Secure Input troubleshooting).
|
||||
* **.dmg** (signed + notarized) and install guide.
|
||||
* **USER\_GUIDE.md** + **TROUBLESHOOTING.md**.
|
||||
**DoD:** Clean install on test machines; distribution checklist passed.
|
||||
|
||||
---
|
||||
|
||||
### Phase 6 — Core ML Backend (Post-MVP)
|
||||
|
||||
**Goal:** Second backend.
|
||||
**Deliverables:**
|
||||
|
||||
* **Core ML** integration (e.g., WhisperKit or custom conversion).
|
||||
* Backend selector (whisper.cpp/Core ML) in Preferences; local benchmarks table.
|
||||
**DoD:** Feature parity and stability; documented pros/cons.
|
||||
|
||||
---
|
||||
|
||||
## 15) Mini-Prompts for the Builder AI (per Phase)
|
||||
|
||||
* **P0:** “Create macOS 13+ SwiftUI menubar app (`MenuBarExtra`), microphone icon, SPM layout with modules in `ARCHITECTURE.md`.”
|
||||
* **P1:** “Add global hotkey (push & toggle) with `RegisterEventHotKey`; NSPanel HUD with RMS bars from AVAudioEngine; 10-minute dictation limit.”
|
||||
* **P2:** “Integrate **whisper.cpp** (Metal); add Model Manager (curated list, SHA256-verified downloads, selection); language auto/forced; transcribe WAV 16 kHz mono.”
|
||||
* **P3:** “Implement insertion: pasteboard+⌘V and CGEvent typing fallback; detect `IsSecureEventInputEnabled()` and avoid injection.”
|
||||
* **P4:** “Implement full Preferences, optional preview, JSON export/import; UX polish and messages.”
|
||||
* **P5:** “Signing + notarization; produce .dmg; write USER\_GUIDE and TROUBLESHOOTING (with Secure Input section).”
|
||||
* **P6:** “Add Core ML backend (WhisperKit/custom), backend selector, and local benchmarks.”
|
||||
|
||||
---
|
||||
|
||||
## 16) Suggested Repo Layout
|
||||
|
||||
```
|
||||
MenuWhisper/
|
||||
Sources/
|
||||
App/ # SwiftUI + AppKit bridges
|
||||
Core/
|
||||
Audio/ # AVAudioEngine capture + meters
|
||||
STT/
|
||||
WhisperCPP/ # C/C++ wrapper + Metal path
|
||||
CoreML/ # post-MVP
|
||||
Models/ # catalog, downloads, hashes
|
||||
Injection/ # clipboard, CGEvent typing, secure input checks
|
||||
Permissions/
|
||||
Settings/
|
||||
Utils/
|
||||
Resources/ # icons, sounds, localizations
|
||||
Docs/ # ARCHITECTURE.md, USER_GUIDE.md, TROUBLESHOOTING.md
|
||||
Scripts/ # build, sign, notarize
|
||||
Tests/ # unit + integration
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 17) Risks & Mitigations
|
||||
|
||||
* **Hotkey collision (⌘⇧V)** with “Paste and Match Style” in some apps → make it discoverable & easily rebindable; warn on conflict.
|
||||
* **Secure Input** blocks injection → inform the user, keep text on clipboard, provide help to identify the app enabling it.
|
||||
* **RAM/latency** with large models → recommend **small/base** by default; show RAM/latency hints in the model picker.
|
||||
* **Keyboard layouts** → prefer paste; if typing, map using the active layout.
|
||||
|
||||
---
|
||||
|
||||
## 18) Global MVP Definition of Done
|
||||
|
||||
* A 30–90 s dictation yields accurate ES/EN text **offline** and inserts correctly in common apps.
|
||||
* Secure Input is correctly detected and handled.
|
||||
* Model download/selection is robust and user-driven.
|
||||
* Shippable **.dmg** (signed + notarized) and clear docs included.
|
||||
Loading…
Add table
Add a link
Reference in a new issue