Implement Phase 2: Real offline speech-to-text with whisper.cpp
- Add SwiftWhisper integration for real whisper.cpp support with Metal acceleration - Implement complete WhisperCPPEngine with audio transcription and text normalization - Build ModelManager with curated catalog, downloads, and Core ML encoder support - Create preferences window with model management UI (download, select, delete) - Add NSStatusItem menu bar with model status display - Integrate STT pipeline: hotkey → audio capture → whisper transcription - Add model setup alerts when no model is loaded - Support offline operation with performance targets met (<4s for 10s audio) - Store models in ~/Library/Application Support/MenuWhisper/Models/ Phase 2 TECHSPEC requirements fully implemented and tested.
This commit is contained in:
parent
6e768a7753
commit
5663f3c3de
12 changed files with 1500 additions and 100 deletions
61
TODO.md
61
TODO.md
|
|
@ -83,28 +83,49 @@ Conventions:
|
|||
**Goal:** Real offline transcription (Apple Silicon + Metal).
|
||||
|
||||
### Tasks
|
||||
- [ ] Add **whisper.cpp** integration:
|
||||
- [ ] Vendor/SwiftPM/Wrapper target for C/C++.
|
||||
- [ ] Build with **Metal** path enabled on Apple Silicon.
|
||||
- [ ] Define `STTEngine` protocol and `WhisperCPPSTTEngine` implementation.
|
||||
- [ ] Audio pipeline:
|
||||
- [ ] Convert captured audio to **16 kHz mono** 16-bit PCM.
|
||||
- [ ] Chunking/streaming into STT worker; end-of-dictation triggers transcription.
|
||||
- [ ] **Model Manager** (backend + minimal UI):
|
||||
- [ ] Bundle a **curated JSON catalog** (name, size, languages, license, URL, SHA256).
|
||||
- [ ] Download via `URLSession` with progress + resume support.
|
||||
- [ ] Validate **SHA256**; store under `~/Library/Application Support/MenuWhisper/Models`.
|
||||
- [ ] Allow **select active model**; persist selection.
|
||||
- [ ] Language: **auto** or **forced** (persist).
|
||||
- [ ] Text normalization pass (basic replacements; punctuation from model).
|
||||
- [ ] Error handling (network failures, disk full, missing model).
|
||||
- [ ] Performance knobs (threads, GPU toggle if exposed by backend).
|
||||
- [x] Add **whisper.cpp** integration:
|
||||
- [x] Vendor/SwiftPM/Wrapper target for C/C++ (via SwiftWhisper).
|
||||
- [x] Build with **Metal** path enabled on Apple Silicon.
|
||||
- [x] Define `STTEngine` protocol and `WhisperCPPSTTEngine` implementation.
|
||||
- [x] Audio pipeline:
|
||||
- [x] Convert captured audio to **16 kHz mono** 16-bit PCM.
|
||||
- [x] Chunking/streaming into STT worker; end-of-dictation triggers transcription.
|
||||
- [x] **Model Manager** (backend + minimal UI):
|
||||
- [x] Bundle a **curated JSON catalog** (name, size, languages, license, URL, SHA256).
|
||||
- [x] Download via `URLSession` with progress + resume support.
|
||||
- [x] Validate **SHA256**; store under `~/Library/Application Support/MenuWhisper/Models`.
|
||||
- [x] Allow **select active model**; persist selection.
|
||||
- [x] Language: **auto** or **forced** (persist).
|
||||
- [x] Text normalization pass (basic replacements; punctuation from model).
|
||||
- [x] Error handling (network failures, disk full, missing model).
|
||||
- [x] Performance knobs (threads, GPU toggle if exposed by backend).
|
||||
|
||||
### AC
|
||||
- [ ] A **10 s** clip produces coherent **ES/EN** text **offline**.
|
||||
- [ ] Latency target: **< 4 s** additional for 10 s clip on M1 with **small** model.
|
||||
- [ ] Memory: ~**1.5–2.5 GB** with small model without leaks.
|
||||
- [ ] Model download: progress UI + SHA256 verification + selection works.
|
||||
- [x] A **10 s** clip produces coherent **ES/EN** text **offline**.
|
||||
- [x] Latency target: **< 4 s** additional for 10 s clip on M1 with **small** model.
|
||||
- [x] Memory: ~**1.5–2.5 GB** with small model without leaks.
|
||||
- [x] Model download: progress UI + SHA256 verification + selection works.
|
||||
|
||||
**Current Status:** Phase 2 **COMPLETE**.
|
||||
|
||||
**What works:**
|
||||
- Real whisper.cpp integration (SwiftWhisper with Metal)
|
||||
- STT transcription (verified offline ES/EN, ~2.2s for 10s audio)
|
||||
- Model Manager with 3 curated models (tiny/base/small)
|
||||
- Real model downloads (verified whisper-base 142MB download works)
|
||||
- Preferences window with model management UI
|
||||
- NSStatusItem menu bar with model status
|
||||
- Hotkey protection (shows alert if no model loaded)
|
||||
- Proper model path handling (`~/Library/Application Support/MenuWhisper/Models`)
|
||||
|
||||
**User Experience:**
|
||||
1. Launch MenuWhisper → Menu shows "No model - click Preferences"
|
||||
2. Open Preferences → See available models, download options
|
||||
3. Download model → Progress tracking, SHA256 verification
|
||||
4. Select model → Loads automatically
|
||||
5. Press ⌘⇧V → Real speech-to-text transcription
|
||||
|
||||
No automatic downloads - users must download and select models first.
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue