tellme/TODO.md
Felipe M. 6e768a7753
Implement Phase 1: Global hotkey, HUD, and audio capture
Add complete listening UX without STT:
- Global hotkey manager with ⌘⇧V, push-to-talk and toggle modes
- Floating HUD with real-time RMS audio visualization
- AVAudioEngine capture with 16kHz mono PCM conversion
- 10-minute dictation timeout with ESC cancellation
- Optional start/stop sounds and microphone permissions
- Permission management for accessibility and input monitoring

All Phase 1 acceptance criteria met.
2025-09-18 20:06:46 +02:00

9.1 KiB
Raw Blame History

# TODO — Menu-Whisper (macOS, Swift, Offline STT)

This file tracks the tasks needed to deliver the app in **phases** with clear acceptance checks.
Conventions:
- `[ ]` = to do, `[x]` = done
- **AC** = Acceptance Criteria
- All features must work **offline** after models are installed.

---

## Global / Project-Wide

- [x] Set project license to **MIT** and add `LICENSE` file.
- [x] Add `README.md` with high-level summary, build requirements (Xcode, macOS 13+), Apple Silicon-only note.
- [x] Add `Docs/ARCHITECTURE.md` skeleton (to be filled in Phase 0).
- [x] Create base **localization** scaffolding (`en.lproj`, `es.lproj`) with `Localizable.strings`.
- [x] Add SwiftPM structure with separate targets for `App`, `Core/*` modules.
- [x] Prepare optional tooling:
  - [x] SwiftFormat / SwiftLint config (opt-in).
  - [x] GitHub Actions macOS runner for **build-only** CI (optional).

---

## Phase 0 — Scaffolding (MVP-0)

**Goal:** Base project + menu bar item; structure and docs.

### Tasks
- [x] Create SwiftUI macOS app (macOS 13+) with `MenuBarExtra` / `NSStatusItem`.
- [x] Add placeholder mic icon (template asset).
- [x] Create module targets:
  - [x] `Core/Audio`
  - [x] `Core/STT` (with subfolders `WhisperCPP` and `CoreML` (stub))
  - [x] `Core/Models`
  - [x] `Core/Injection`
  - [x] `Core/Permissions`
  - [x] `Core/Settings`
  - [x] `Core/Utils`
- [x] Wire a minimal state machine: `Idle` state shown in menubar menu.
- [x] Add scripts:
  - [x] `Scripts/build.sh` (SPM/Xcodebuild)
  - [x] `Scripts/notarize.sh` (stub with placeholders for later)
- [x] Write `Docs/ARCHITECTURE.md` (modules, data flow, FSM diagram).

### AC
- [x] Project compiles and shows a **menu bar** icon with a basic menu.
- [x] Repo has clear structure and architecture doc.

---

## Phase 1 — Hotkey + HUD + Audio (MVP-1)

**Goal:** Listening UX without real STT.

### Tasks
- [x] Implement **global hotkey** manager:
  - [x] Default **⌘⇧V** (configurable later).
  - [x] Support **push-to-talk** (start on key down, stop on key up).
  - [x] Support **toggle** (press to start, press to stop).
- [x] Create **HUD** as non-activating centered `NSPanel`:
  - [x] State **Listening** with **RMS/peak bars** animation (SwiftUI view).
  - [x] State **Processing** with spinner/label.
  - [x] Dismiss/cancel with **Esc**.
- [x] Implement **AVAudioEngine** capture:
  - [x] Tap on input bus; compute RMS/peak for visualization.
  - [x] Resample path ready for 16 kHz mono PCM (no STT yet).
- [x] Add dictation **time limit** (default **10 min**, configurable later).
- [x] Optional **sounds** for start/stop (toggle in settings later).
- [x] Permissions onboarding:
  - [x] Request **Microphone** permission with Info.plist string.
  - [x] Show guide for **Accessibility** and **Input Monitoring** (no hard gating yet).

### AC
- [x] Hotkey works in both modes (push/toggle) across desktop & full-screen apps.
- [x] HUD appears centered; **Listening** shows live bars; **Processing** shows spinner.
- [x] Cancel (Esc) reliably stops listening and hides HUD.

---

## Phase 2 — STT via whisper.cpp (MVP-2)

**Goal:** Real offline transcription (Apple Silicon + Metal).

### Tasks
- [ ] Add **whisper.cpp** integration:
  - [ ] Vendor/SwiftPM/Wrapper target for C/C++.
  - [ ] Build with **Metal** path enabled on Apple Silicon.
  - [ ] Define `STTEngine` protocol and `WhisperCPPSTTEngine` implementation.
- [ ] Audio pipeline:
  - [ ] Convert captured audio to **16 kHz mono** 16-bit PCM.
  - [ ] Chunking/streaming into STT worker; end-of-dictation triggers transcription.
- [ ] **Model Manager** (backend + minimal UI):
  - [ ] Bundle a **curated JSON catalog** (name, size, languages, license, URL, SHA256).
  - [ ] Download via `URLSession` with progress + resume support.
  - [ ] Validate **SHA256**; store under `~/Library/Application Support/MenuWhisper/Models`.
  - [ ] Allow **select active model**; persist selection.
  - [ ] Language: **auto** or **forced** (persist).
- [ ] Text normalization pass (basic replacements; punctuation from model).
- [ ] Error handling (network failures, disk full, missing model).
- [ ] Performance knobs (threads, GPU toggle if exposed by backend).

### AC
- [ ] A **10 s** clip produces coherent **ES/EN** text **offline**.
- [ ] Latency target: **< 4 s** additional for 10 s clip on M1 with **small** model.
- [ ] Memory: ~**1.52.5 GB** with small model without leaks.
- [ ] Model download: progress UI + SHA256 verification + selection works.

---

## Phase 3 — Robust Text Insertion (MVP-3)

**Goal:** Insert text into focused app safely; handle Secure Input.

### Tasks
- [ ] Implement **Paste** method:
  - [ ] Put text on **NSPasteboard** (general).
  - [ ] Send **V** via CGEvent to focused app.
- [ ] Implement **Typing** fallback:
  - [ ] Generate per-character **CGEvent**; respect active keyboard layout.
  - [ ] Handle `\n`, `\t`, and common unicode safely.
- [ ] Detect **Secure Input**:
  - [ ] Use `IsSecureEventInputEnabled()` (or accepted API) check before injection.
  - [ ] If enabled: **do not inject**; keep text on clipboard; show non-blocking notice.
- [ ] Add preference for **insertion method** (Paste preferred) + fallback strategy.
- [ ] Add **Permissions** helpers for Accessibility/Input Monitoring (deep links).
- [ ] Compatibility tests: Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.

### AC
- [ ] Text reliably appears in the currently focused app via Paste.
- [ ] If Paste is blocked, Typing fallback works (except in Secure Input).
- [ ] When **Secure Input** is active: no injection occurs; clipboard contains the text; user is informed.

---

## Phase 4 — Preferences + UX Polish (MVP-4)

**Goal:** Complete options, localization, and stability.

### Tasks
- [ ] Full **Preferences** window:
  - [ ] Hotkey recorder (change ⌘⇧V if needed).
  - [ ] Mode: Push-to-talk / Toggle.
  - [ ] Model picker: list, **download**, **delete**, **set active**, show size/language/license.
  - [ ] Language: Auto / Forced (dropdown).
  - [ ] Insertion: **Direct** (default) vs **Preview**; Paste vs Typing preference.
  - [ ] HUD: opacity/size, show/hide sounds toggles.
  - [ ] Dictation limit: editable (default 10 min).
  - [ ] Advanced: threads/batch; **local logs opt-in**.
  - [ ] **Export/Import** settings (JSON).
- [ ] Implement **Preview** dialog (off by default): shows transcribed text with **Insert** / **Cancel**.
- [ ] Expand **localization** (ES/EN) for all UI strings.
- [ ] Onboarding & help views (permissions, Secure Input explanation).
- [ ] Persist all settings in `UserDefaults`; validate on load; migrate if needed.
- [ ] UX polish: icons, animation timing, keyboard navigation, VoiceOver labels.
- [ ] Optional: internal **timing instrumentation** (guarded by logs opt-in).

### AC
- [ ] All preferences persist and take effect without relaunch.
- [ ] Preview (when enabled) allows quick edit & insertion.
- [ ] ES/EN localization passes a manual spot-check.

---

## Phase 5 — Distribution (MVP-5)

**Goal:** Shippable, signed/notarized .dmg, user docs.

### Tasks
- [ ] Hardened runtime, entitlements, Info.plist:
  - [ ] `NSMicrophoneUsageDescription`
  - [ ] Review for any additional required entitlements.
- [ ] **Code signing** with Developer ID; set team identifiers.
- [ ] **Notarization** using `notarytool`; **staple** on success.
- [ ] Build **.app** and create **.dmg**:
  - [ ] DMG background, /Applications symlink, icon.
- [ ] Write **Docs/USER_GUIDE.md** (first run, downloading models, dictation flow).
- [ ] Write **Docs/TROUBLESHOOTING.md** (permissions, Secure Input, model space/RAM issues).
- [ ] QA matrix:
  - [ ] macOS **13/14/15**, Apple Silicon **M1/M2/M3**.
  - [ ] Target apps list (insertion works).
  - [ ] Offline check (network disabled).
- [ ] Prepare **VERSIONING** notes and changelog (semantic-ish).

### AC
- [ ] Signed & **notarized** .dmg installs cleanly.
- [ ] App functions **entirely offline** post-model download.
- [ ] Guides are complete and reference all common pitfalls.

---

## Phase 6 — Core ML Backend (Post-MVP)

**Goal:** Second STT backend and selector.

### Tasks
- [ ] Evaluate **Core ML** path (e.g., WhisperKit or custom Core ML models).
- [ ] Implement `STTEngineCoreML` conforming to `STTEngine` protocol.
- [ ] Backend **selector** in Preferences; runtime switching.
- [ ] Ensure **feature parity** (language settings, output normalization).
- [ ] **Benchmarks**: produce local latency/memory table across small/base/medium.
- [ ] Errors & fallbacks (if model missing, surface helpful guidance).

### AC
- [ ] Both backends run on Apple Silicon; user can switch backends.
- [ ] Comparable outputs; documented pros/cons and performance data.

---

## Backlog / Post-MVP Options

- [ ] **VAD (WebRTC)**: auto-stop on silence with thresholds.
- [ ] **Continuous dictation** with smart segmentation.
- [ ] **Noise suppression** and AGC in the audio pipeline.
- [ ] **Login item** (auto-launch at login).
- [ ] **Sparkle** or custom updater (if desirable outside App Store).
- [ ] **Settings profiles** (per-language/model presets).
- [ ] **In-app model catalog refresh** (remote JSON update).
- [ ] **Advanced insertion rules** (per-app behavior).
- [ ] **Analytics viewer** for local logs (no telemetry).

---