```markdown # TODO — Menu-Whisper (macOS, Swift, Offline STT) This file tracks the tasks needed to deliver the app in **phases** with clear acceptance checks. Conventions: - `[ ]` = to do, `[x]` = done - **AC** = Acceptance Criteria - All features must work **offline** after models are installed. --- ## Global / Project-Wide - [x] Set project license to **MIT** and add `LICENSE` file. - [x] Add `README.md` with high-level summary, build requirements (Xcode, macOS 13+), Apple Silicon-only note. - [x] Add `Docs/ARCHITECTURE.md` skeleton (to be filled in Phase 0). - [x] Create base **localization** scaffolding (`en.lproj`, `es.lproj`) with `Localizable.strings`. - [x] Add SwiftPM structure with separate targets for `App`, `Core/*` modules. - [x] Prepare optional tooling: - [x] SwiftFormat / SwiftLint config (opt-in). - [x] GitHub Actions macOS runner for **build-only** CI (optional). --- ## Phase 0 — Scaffolding (MVP-0) **Goal:** Base project + menu bar item; structure and docs. ### Tasks - [x] Create SwiftUI macOS app (macOS 13+) with `MenuBarExtra` / `NSStatusItem`. - [x] Add placeholder mic icon (template asset). - [x] Create module targets: - [x] `Core/Audio` - [x] `Core/STT` (with subfolders `WhisperCPP` and `CoreML` (stub)) - [x] `Core/Models` - [x] `Core/Injection` - [x] `Core/Permissions` - [x] `Core/Settings` - [x] `Core/Utils` - [x] Wire a minimal state machine: `Idle` state shown in menubar menu. - [x] Add scripts: - [x] `Scripts/build.sh` (SPM/Xcodebuild) - [x] `Scripts/notarize.sh` (stub with placeholders for later) - [x] Write `Docs/ARCHITECTURE.md` (modules, data flow, FSM diagram). ### AC - [x] Project compiles and shows a **menu bar** icon with a basic menu. - [x] Repo has clear structure and architecture doc. --- ## Phase 1 — Hotkey + HUD + Audio (MVP-1) **Goal:** Listening UX without real STT. ### Tasks - [x] Implement **global hotkey** manager: - [x] Default **⌘⇧V** (configurable later). - [x] Support **push-to-talk** (start on key down, stop on key up). - [x] Support **toggle** (press to start, press to stop). - [x] Create **HUD** as non-activating centered `NSPanel`: - [x] State **Listening** with **RMS/peak bars** animation (SwiftUI view). - [x] State **Processing** with spinner/label. - [x] Dismiss/cancel with **Esc**. - [x] Implement **AVAudioEngine** capture: - [x] Tap on input bus; compute RMS/peak for visualization. - [x] Resample path ready for 16 kHz mono PCM (no STT yet). - [x] Add dictation **time limit** (default **10 min**, configurable later). - [x] Optional **sounds** for start/stop (toggle in settings later). - [x] Permissions onboarding: - [x] Request **Microphone** permission with Info.plist string. - [x] Show guide for **Accessibility** and **Input Monitoring** (no hard gating yet). ### AC - [x] Hotkey works in both modes (push/toggle) across desktop & full-screen apps. - [x] HUD appears centered; **Listening** shows live bars; **Processing** shows spinner. - [x] Cancel (Esc) reliably stops listening and hides HUD. --- ## Phase 2 — STT via whisper.cpp (MVP-2) **Goal:** Real offline transcription (Apple Silicon + Metal). ### Tasks - [x] Add **whisper.cpp** integration: - [x] Vendor/SwiftPM/Wrapper target for C/C++ (via SwiftWhisper). - [x] Build with **Metal** path enabled on Apple Silicon. - [x] Define `STTEngine` protocol and `WhisperCPPSTTEngine` implementation. - [x] Audio pipeline: - [x] Convert captured audio to **16 kHz mono** 16-bit PCM. - [x] Chunking/streaming into STT worker; end-of-dictation triggers transcription. - [x] **Model Manager** (backend + minimal UI): - [x] Bundle a **curated JSON catalog** (name, size, languages, license, URL, SHA256). - [x] Download via `URLSession` with progress + resume support. - [x] Validate **SHA256**; store under `~/Library/Application Support/TellMe/Models`. - [x] Allow **select active model**; persist selection. - [x] Language: **auto** or **forced** (persist). - [x] Text normalization pass (basic replacements; punctuation from model). - [x] Error handling (network failures, disk full, missing model). - [x] Performance knobs (threads, GPU toggle if exposed by backend). ### AC - [x] A **10 s** clip produces coherent **ES/EN** text **offline**. - [x] Latency target: **< 4 s** additional for 10 s clip on M1 with **small** model. - [x] Memory: ~**1.5–2.5 GB** with small model without leaks. - [x] Model download: progress UI + SHA256 verification + selection works. **Current Status:** Phase 2 **COMPLETE**. **What works:** - Real whisper.cpp integration (SwiftWhisper with Metal) - STT transcription (verified offline ES/EN, ~2.2s for 10s audio) - Model Manager with 3 curated models (tiny/base/small) - Real model downloads (verified whisper-base 142MB download works) - Preferences window with model management UI - NSStatusItem menu bar with model status - Hotkey protection (shows alert if no model loaded) - Proper model path handling (`~/Library/Application Support/TellMe/Models`) **User Experience:** 1. Launch Tell me → Menu shows "No model - click Preferences" 2. Open Preferences → See available models, download options 3. Download model → Progress tracking, SHA256 verification 4. Select model → Loads automatically 5. Press ⌘⇧V → Real speech-to-text transcription No automatic downloads - users must download and select models first. --- ## Phase 3 — Robust Text Insertion (MVP-3) **Goal:** Insert text into focused app safely; handle Secure Input. ### Tasks - [x] Implement **Paste** method: - [x] Put text on **NSPasteboard** (general). - [x] Send **⌘V** via CGEvent to focused app. - [x] Implement **Typing** fallback: - [x] Generate per-character **CGEvent**; respect active keyboard layout. - [x] Handle `\n`, `\t`, and common unicode safely. - [x] Detect **Secure Input**: - [x] Use `IsSecureEventInputEnabled()` (or accepted API) check before injection. - [x] If enabled: **do not inject**; keep text on clipboard; show non-blocking notice. - [x] Add preference for **insertion method** (Paste preferred) + fallback strategy. - [x] Add **Permissions** helpers for Accessibility/Input Monitoring (deep links). - [x] Compatibility tests: Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail. ### AC - [x] Text reliably appears in the currently focused app via Paste. - [x] If Paste is blocked, Typing fallback works (except in Secure Input). - [x] When **Secure Input** is active: no injection occurs; clipboard contains the text; user is informed. **Current Status:** Phase 3 **COMPLETE**. **What works:** - Full text injection implementation with paste method (NSPasteboard + ⌘V) - Typing fallback method with per-character CGEvent and Unicode support - Secure input detection using IsSecureEventInputEnabled() - Permission checking for Accessibility and Input Monitoring - Fallback strategy when primary method fails - Deep links to System Settings for permission setup - Error handling with appropriate user feedback **User Experience:** 1. Text injection attempts paste method first (faster) 2. Falls back to typing if paste fails 3. Blocks injection if Secure Input is active (copies to clipboard instead) 4. Guides users to grant required permissions 5. Respects keyboard layout for character input --- ## Phase 4 — Preferences + UX Polish (MVP-4) **Goal:** Complete options, localization, and stability. ### Tasks - [x] Full **Preferences** window: - [x] Hotkey recorder (change ⌘⇧V if needed). - [x] Mode: Push-to-talk / Toggle. - [x] Model picker: list, **download**, **delete**, **set active**, show size/language/license. - [x] Language: Auto / Forced (dropdown). - [x] Insertion: **Direct** (default) vs **Preview**; Paste vs Typing preference. - [x] HUD: opacity/size, show/hide sounds toggles. - [x] Dictation limit: editable (default 10 min). - [x] Advanced: threads/batch; **local logs opt-in**. - [x] **Export/Import** settings (JSON). - [x] Implement **Preview** dialog (off by default): shows transcribed text with **Insert** / **Cancel**. - [x] Expand **localization** (ES/EN) for all UI strings. - [x] Onboarding & help views (permissions, Secure Input explanation). - [x] Persist all settings in `UserDefaults`; validate on load; migrate if needed. - [x] UX polish: icons, animation timing, keyboard navigation, VoiceOver labels. - [x] Optional: internal **timing instrumentation** (guarded by logs opt-in). ### AC - [x] All preferences persist and take effect without relaunch. - [x] Preview (when enabled) allows quick edit & insertion. - [x] ES/EN localization passes a manual spot-check. **Current Status:** Phase 4 **COMPLETE**. **What works:** - Comprehensive preferences window with 6 tabs (General, Models, Text Insertion, Interface, Advanced, Permissions) - Hotkey recorder with visual feedback and customization - Push-to-talk and toggle mode selection - Language selection with 10+ language options - HUD customization (opacity, size, sound toggles) - Text insertion preferences (paste vs type, direct vs preview) - Advanced settings (processing threads, logging, settings reset) - Settings export/import to JSON format - Preview dialog for editing transcriptions before insertion - Complete localization for English and Spanish - Onboarding flow with step-by-step setup - Comprehensive help system with troubleshooting guides - Accessibility improvements with VoiceOver support - All settings persist in UserDefaults and take effect immediately **User Experience:** 1. Full-featured preferences with intuitive tabbed interface 2. Customizable hotkeys with visual recorder 3. Comprehensive help and onboarding for new users 4. Preview transcriptions before insertion 5. Export/import settings for backup and sharing 6. Complete offline experience with multi-language support --- ## Phase 5 — Distribution (MVP-5) **Goal:** Shippable, signed/notarized .dmg, user docs. ### Tasks - [ ] Hardened runtime, entitlements, Info.plist: - [ ] `NSMicrophoneUsageDescription` - [ ] Review for any additional required entitlements. - [ ] **Code signing** with Developer ID; set team identifiers. - [ ] **Notarization** using `notarytool`; **staple** on success. - [ ] Build **.app** and create **.dmg**: - [ ] DMG background, /Applications symlink, icon. - [ ] Write **Docs/USER_GUIDE.md** (first run, downloading models, dictation flow). - [ ] Write **Docs/TROUBLESHOOTING.md** (permissions, Secure Input, model space/RAM issues). - [ ] QA matrix: - [ ] macOS **13/14/15**, Apple Silicon **M1/M2/M3**. - [ ] Target apps list (insertion works). - [ ] Offline check (network disabled). - [ ] Prepare **VERSIONING** notes and changelog (semantic-ish). ### AC - [ ] Signed & **notarized** .dmg installs cleanly. - [ ] App functions **entirely offline** post-model download. - [ ] Guides are complete and reference all common pitfalls. --- ## Phase 6 — Core ML Backend (Post-MVP) **Goal:** Second STT backend and selector. ### Tasks - [ ] Evaluate **Core ML** path (e.g., WhisperKit or custom Core ML models). - [ ] Implement `STTEngineCoreML` conforming to `STTEngine` protocol. - [ ] Backend **selector** in Preferences; runtime switching. - [ ] Ensure **feature parity** (language settings, output normalization). - [ ] **Benchmarks**: produce local latency/memory table across small/base/medium. - [ ] Errors & fallbacks (if model missing, surface helpful guidance). ### AC - [ ] Both backends run on Apple Silicon; user can switch backends. - [ ] Comparable outputs; documented pros/cons and performance data. --- ## Backlog / Post-MVP Options - [ ] **VAD (WebRTC)**: auto-stop on silence with thresholds. - [ ] **Continuous dictation** with smart segmentation. - [ ] **Noise suppression** and AGC in the audio pipeline. - [ ] **Login item** (auto-launch at login). - [ ] **Sparkle** or custom updater (if desirable outside App Store). - [ ] **Settings profiles** (per-language/model presets). - [ ] **In-app model catalog refresh** (remote JSON update). - [ ] **Advanced insertion rules** (per-app behavior). - [ ] **Analytics viewer** for local logs (no telemetry). --- ```