tellme/TECHSPEC.md at master

fmartingr/tellme

Fork 0

Felipe M. 1db16227b2

Initial commit

2025-09-18 19:56:06 +02:00

13 KiB

Raw Permalink Blame History

0) Owner Decisions (Locked)

Platform: Apple Silicon only (M1/M2/M3), macOS 13+.
STT backends: Start with whisper.cpp (Metal) for simplicity; add Core ML backend later.
Models: Do not auto-download. On first run, user chooses & downloads a model.
VAD: Post-MVP.
Insertion behavior: Configurable; direct insertion is default (no preview).
Default hotkey: ⌘⇧V (user-configurable).
Punctuation: Let the model handle punctuation automatically (no spoken commands).
Privacy/Connectivity: 100% local at runtime; model downloads only when the user explicitly requests. No telemetry.
Distribution: .app/.dmg (signed + notarized), outside the Mac App Store initially.
UI languages: ES/EN.
Low-power mode: Still allow downloads if the user starts them.
License: MIT.
Per-dictation limit: 10 minutes by default (configurable).

1) Goal

A menu bar app for macOS that performs offline speech-to-text using Whisper-family models and inserts the transcribed text into whichever app currently has focus. Shows a minimal HUD while listening and processing. No internet required during normal operation.

2) MVP Scope

Persistent menu bar item (NSStatusItem / MenuBarExtra).
Global hotkey (push-to-talk and toggle modes).
HUD (centered NSPanel + SwiftUI):
- “Listening” with audio-level animation (RMS/peak).
- “Processing” with a spinner/animation.
Offline STT with whisper.cpp (GGUF models; Metal acceleration on Apple Silicon).
Model Manager: curated list, manual download with progress + SHA256 check, user selection.
Text injection:
- Preferred: Clipboard + ⌘V paste.
- Fallback: simulated typing via CGEvent.
- If Secure Input is active, do not inject; show notice and keep text on clipboard.
Preferences: hotkey & mode, model & language, insertion method, HUD styling, sounds, dictation limit.
Permissions onboarding: Microphone, Accessibility, Input Monitoring.

3) Functional Requirements

3.1 Capture

Prompt for permissions on first use.
Global hotkey (default ⌘⇧V).
Push-to-talk: start on key down, stop on key up.
Toggle: press to start, press again to stop.
Per-dictation limit (default 10 min, range 10 s–30 min).

3.2 HUD / UX

Non-activating, centered NSPanel (~320×160), no focus stealing.
Listening: bar-style audio visualization driven by live RMS/peak.
Processing: spinner + “Transcribing…” label.
Esc to cancel.
Optional start/stop sounds (user-toggleable).

3.3 STT

Backend A (MVP): whisper.cpp with GGUF and Metal.
Language: auto-detect or forced (persisted).
Basic text normalization; punctuation from the model.
UTF-8 output; standard replacements (quotes, dashes, etc.).

3.4 Injection

Preferred method: NSPasteboard + CGEvent to send ⌘V.
Fallback: CGEventCreateKeyboardEvent (character-by-character), respecting active keyboard layout.
Secure Input: detect with IsSecureEventInputEnabled(); if enabled, do not inject. Show a non-intrusive notice and leave the text on the clipboard.

3.5 Preferences

General: hotkey + mode (push/toggle), sounds, HUD options.
Models: catalog, download, select active model, language, local storage path.
Insertion: direct vs preview (preview off by default), paste vs type.
Advanced: limits, performance knobs (threads/batch), local logs opt-in.

4) Non-Functional Requirements

Offline execution after models are installed.
Latency target (M1 + “small” model): < 4 s for 10 s of audio.
Memory target: ~1.5–2.5 GB with “small”.
Privacy: audio and text never leave the device.
Accessibility: sufficient contrast; VoiceOver labels; focus never stolen by HUD.

5) Architecture (High-Level)

App (SwiftUI) with AppKit bridges for NSStatusItem and NSPanel.
Shortcut Manager (Carbon RegisterEventHotKey or HotKey/MASShortcut).
Audio: AVAudioEngine (downsample to 16 kHz mono, 16-bit PCM).
STT Engine:
- whisper.cpp (C/C++ via SPM/CMake) with Metal.
- Core ML backend (e.g., WhisperKit / custom) in a later phase.
Model Manager: curated catalog, downloads (progress + SHA256), selection, caching.
Text Injection: pasteboard + CGEvent; typing fallback; Secure Input detection.
Permissions Manager: guided flows to System Settings panes.
Settings: UserDefaults + JSON export/import.
Packaging: .app + .dmg (signed & notarized).

6) Main Flow

User presses global hotkey.
Check permissions; guide if missing.
Show HUD → Listening; start capture.
Stop (key up/toggle/timeout).
HUD → Processing; run STT in background.
On result → (optional preview) → insert (paste) or fallback (type). If Secure Input, do not inject; keep in clipboard + show notice.
Close HUD → Idle.

7) Finite State Machine (FSM)

Idle → (Hotkey) → Listening
Listening → (Stop/Timeout) → Processing
Processing → (Done) → Injecting
Injecting → (Done) → Idle
Any → (Error) → ErrorModal → Idle

8) Model Management (Manual Downloads)

Goal: Offer a clear list of free Whisper-family models (names, sizes, languages, recommended backend) with one-click downloads. No automatic downloads.

8.1 OpenAI Whisper (official weights)

Families: tiny, base, small, medium, large-v2, large-v3 (multilingual; some .en variants).
Usable with whisper.cpp via GGUF (community conversions widely available).

8.2 Whisper for whisper.cpp (converted GGUF)

Community-maintained conversions for whisper.cpp (GGUF), optimized for CPU/GPU Metal on macOS.

8.3 Faster-Whisper (CTranslate2)

Optimized variants (tiny/base/small/medium/large-v2/large-v3). Useful if a CT2-based or Core-ML-assisted backend is added later.

8.4 Distil-Whisper (distilled)

Distilled models (e.g., distil-large-v2/v3/v3.5, distil-small.en), significantly smaller/faster with near-large accuracy.

UI must show: model file size, languages, license, RAM estimate, and a warning if a large model is selected on lower-memory machines.

Optional JSON Schema for catalog entries (for the app’s first-run picker):

{
  "name": "whisper-small",
  "family": "OpenAI-Whisper",
  "format": "gguf",
  "size_mb": 466,
  "languages": ["multilingual"],
  "recommended_backend": "whisper.cpp",
  "quality_tier": "small",
  "license": "MIT",
  "sha256": "…",
  "download_url": "…",
  "notes": "Good balance of speed/accuracy on M1/M2."
}

9) Security & Permissions

Info.plist: NSMicrophoneUsageDescription.
Accessibility & Input Monitoring: required for CGEvent; provide clear step-by-step guidance and deep-links.
Secure Input: check IsSecureEventInputEnabled(); never attempt to bypass. Provide help text to identify apps that enable it (password fields, 2FA prompts, etc.).

10) Performance

Lazy-load and reuse model (warm cache).
Real-time downsampling to 16 kHz mono; chunked streaming into backend.
Configurable threads; prefer Metal path on Apple Silicon.
“Fast path” tweaks for short clips (<15 s).

11) Logging & Privacy

No remote telemetry.
Local logs opt-in (timings, errors only). Never store audio/text unless user explicitly enables a debug flag.
“Wipe local data” button (models remain unless the user removes them).

12) Internationalization

UI in Spanish and English (Localizable.strings).
STT multilingual; language auto or forced per user preference.

13) Testing (Minimum)

macOS 13/14/15 on M1/M2/M3.
Injection works in Safari, Chrome, Notes, VS Code, Terminal, iTerm2, Mail.
Secure Input: correctly detected; no injection; clipboard + notice.
Meet latency target with small model on M1.
Model download & selection flows (simulate network errors).

14) Phased Plan (AI-Deliverables)

Phase 0 — Scaffolding (MVP-0)

Goal: Base project + menubar. Deliverables:

SwiftUI app with MenuBarExtra, microphone icon, “Idle” state.
ARCHITECTURE.md describing modules (Audio/STT/Injection/Models/Permissions/Settings).
Build scripts and signing/notarization templates. DoD: Compiles; menu bar item visible; SPM structure ready.

Phase 1 — Hotkey + HUD + Audio (MVP-1)

Goal: Listening UX without real STT. Deliverables:

Global hotkey (default ⌘⇧V) with push and toggle.
NSPanel HUD (Listening/Processing) + real RMS bars from AVAudioEngine.
Per-dictation limit (default 10 min). DoD: Live meter responds to mic; correct state transitions.

Phase 2 — STT via whisper.cpp (MVP-2)

Goal: Real offline transcription. Deliverables:

whisper.cpp module (C/C++), background inference with Metal.
Model Manager (curated list, download with SHA256, selection).
Language auto/forced; basic normalization. DoD: 10-second clip → coherent ES/EN text offline; meets timing targets.

Phase 3 — Robust Insertion (MVP-3)

Goal: Reliable insertion into focused app. Deliverables:

Paste (clipboard + ⌘V) and typing fallback.
Secure Input detection; safe behavior (no injection, clipboard + notice). DoD: Works across target apps; correct Secure Input handling.

Phase 4 — Preferences + UX Polish (MVP-4)

Goal: Complete options & stability. Deliverables:

Full Preferences (hotkey, modes, model, language, insertion, HUD, sounds).
Optional preview dialog (off by default).
Config export/import (JSON). DoD: All settings persist and are honored.

Phase 5 — Distribution (MVP-5)

Goal: Installable package. Deliverables:

Error handling; permission prompts & help (incl. Secure Input troubleshooting).
.dmg (signed + notarized) and install guide.
USER_GUIDE.md + TROUBLESHOOTING.md. DoD: Clean install on test machines; distribution checklist passed.

Phase 6 — Core ML Backend (Post-MVP)

Goal: Second backend. Deliverables:

Core ML integration (e.g., WhisperKit or custom conversion).
Backend selector (whisper.cpp/Core ML) in Preferences; local benchmarks table. DoD: Feature parity and stability; documented pros/cons.

15) Mini-Prompts for the Builder AI (per Phase)

P0: “Create macOS 13+ SwiftUI menubar app (MenuBarExtra), microphone icon, SPM layout with modules in ARCHITECTURE.md.”
P1: “Add global hotkey (push & toggle) with RegisterEventHotKey; NSPanel HUD with RMS bars from AVAudioEngine; 10-minute dictation limit.”
P2: “Integrate whisper.cpp (Metal); add Model Manager (curated list, SHA256-verified downloads, selection); language auto/forced; transcribe WAV 16 kHz mono.”
P3: “Implement insertion: pasteboard+⌘V and CGEvent typing fallback; detect IsSecureEventInputEnabled() and avoid injection.”
P4: “Implement full Preferences, optional preview, JSON export/import; UX polish and messages.”
P5: “Signing + notarization; produce .dmg; write USER_GUIDE and TROUBLESHOOTING (with Secure Input section).”
P6: “Add Core ML backend (WhisperKit/custom), backend selector, and local benchmarks.”

16) Suggested Repo Layout

MenuWhisper/
  Sources/
    App/                 # SwiftUI + AppKit bridges
    Core/
      Audio/             # AVAudioEngine capture + meters
      STT/
        WhisperCPP/      # C/C++ wrapper + Metal path
        CoreML/          # post-MVP
      Models/            # catalog, downloads, hashes
      Injection/         # clipboard, CGEvent typing, secure input checks
      Permissions/
      Settings/
      Utils/
  Resources/             # icons, sounds, localizations
  Docs/                  # ARCHITECTURE.md, USER_GUIDE.md, TROUBLESHOOTING.md
  Scripts/               # build, sign, notarize
  Tests/                 # unit + integration

17) Risks & Mitigations

Hotkey collision (⌘⇧V) with “Paste and Match Style” in some apps → make it discoverable & easily rebindable; warn on conflict.
Secure Input blocks injection → inform the user, keep text on clipboard, provide help to identify the app enabling it.
RAM/latency with large models → recommend small/base by default; show RAM/latency hints in the model picker.
Keyboard layouts → prefer paste; if typing, map using the active layout.

18) Global MVP Definition of Done

A 30–90 s dictation yields accurate ES/EN text offline and inserts correctly in common apps.
Secure Input is correctly detected and handled.
Model download/selection is robust and user-driven.
Shippable .dmg (signed + notarized) and clear docs included.

13 KiB Raw Permalink Blame History Unescape Escape

Technical Definition — “Menu-Whisper” (macOS, Swift, Offline STT)