Skip to content

Latest commit

 

History

History
111 lines (84 loc) · 6.97 KB

File metadata and controls

111 lines (84 loc) · 6.97 KB

Development purpose

I want to develop a virtual carer VC to chat with elderly people in the image and voice of their family members. I had spent 3 months on the development of the main server hosting app to collect family members' basic information to allow VC to create a meaningful conversation. The client's app is a React app, with WebRTC to stream a generated Avatar talking head while capturing the user's speech and faces' expressions to send to the server. After seeing Meta AI and Llama, I think I am re-inventing the wheel. Please read my brief design. The design was developed with assistance from Claude, Chatgpt, Deepseek and Grok. However, none of them tells me that the way I do could be done easier and cheaper using existing technologies.

The system consists of tablet (patient interface), local server (24x7 monitoring service) and the main server to manage all admin related tasks and database for backup.

In the architecture view, [ VC Tablet (carried by patient) ] ├── Face Cam → Emotion + Identity ├── Mic/Speaker → Voice Input + Output ├── BLE Wearables → tracking Heartbeat, SPO2, Movement (Option) ├── Avatar UI → Electron + React or Unity WebGL ├── Memory → SQLite + AI reasoning + create intent + summary event, ├── Alerts → TTS prompts + n8n timers ├── Push Memory (backup on Main server)→ Main server every 5m (Wi-Fi) ├── Communicate with Local Server n8n to report its status

As a tablet, its operational hours are limited by the battery carried; hence, these tasks are active during its engagement with the patient directly in face-to-face interactions. The Local Server is on other hand, more capable of doing these tasks 24x7.

[ Local Server ] └── TV interface + Electron Avatar UI i.e. talking head (only when TV is ON) └── Face Cam → Emotion + Identity(constantly active, using ONVIF Camera for taking images) ├── Mic/Speaker → Voice Input + Output constantly active, using ONVIF Camera's 2-way audio) ├── BLE Wearables → tracking Heartbeat, SPO2, Movement (constantly active) └── 24/7 watchdog (fall, distress, remote queries, (constantly active using ONVIF Camera) ├── Memory → SQLite + AI reasoning ├── Alerts → TTS prompts + n8n timers ├── Push Memory (backup on Main server)→ Main server every 5m (Wi-Fi) ├── Communicate with Tablets n8n, acknowledges its status, for example i.e. remove face detection, emotion analysis if tablet is active, voiding the duplicated efforts

[ Main Server - Cloud based ] └── Provide and host a dashboard for the admin -

  1. showing the system operational parameters, number of patients, carers, family members, active local servers|tablets, etc.. info that will enable 3rd parties' human assistance to mobilise physical help to the patients on demand if all layers of contacts fail, through visual displays.

└── Provide and host the mobile web app for human carers or family members

  1. To register the faces/voice/info of patients, carers, doctor, family members,
  2. To view the patient's log under care.
  3. To add the patient's medical history (using voice, images, pdf, text)
  4. To add the patient's family history's event (using voice, images, pdf, text)
  5. To add 3rd party's human care 24-hour emergency

└── Provide API for 3rd parties to access.

The Main Server will have access to a GPU-powered server a. to generate a voice model, video clips of carers, family members, to send to the Local Server and Tablet to assist with building the Avatar. This process is operated off-line, and it can be done in batches. b. to generate intent and summarise events from the patient's medical history, the patient's family history's events. This process is operated off-line, and it can be done in batches.

Avatart generation using HeyGen or D-ID API

┌─ Patient Speaks ──┐ ┌─ Conversation Service ─┐ ┌─ Avatar Service ─┐ │ 1. Voice Input │───▶│ 2. T → LLM → TTS │───▶│ 3. Generate Video │ │ WebRTC Audio, STT │ │ Generate Response │ │ HeyGen/D-ID API │ └───────────────────┘ └────────────────────────┘ └───────────────────┘ ▲ │ │ ┌─ WebRTC Bridge ─────────────────────┐ │ └────────────────│ 4. Stream Avatar Video │◀┘ │ Real-time WebRTC Delivery │ └─────────────────────────────────────┘

XTTS V2

Option B please, the text should be segmented into a short sentence with full stop. For example: This morning, I went to school. At late afternoon, I got home and called you. will be segmented:

  1. This morning, I went to school.
  2. At late afternoon, I got home and called you.

By breaking the text into smaller sentences, the 1st sentence can be streamed immediately after 1.5s while the 2nd sentence are generating.

run docker using the same model - void to re-download

  1. Create the host cache dir and run the container (first run will download models into host dir): mkdir -p ~/avatar_models docker run -it --rm
    -v ~/avatar_models:/root/.cache/tts_models
    avatar-xtts:latest
    bash -c "ls -la /root/.cache/tts_models || echo 'empty'; python -c 'import os; print("cwd", os.getcwd())'"

  2. After the container runs and downloads models, inspect the host dir to confirm files: ls -la ~/avatar_models

  3. Run another container mounting the same dir — it should reuse the cache (no re-download): docker run -it --rm
    -v ~/avatar_models:/root/.cache/tts_models
    avatar-xtts:latest
    bash -c "ls -la /root/.cache/tts_models | head -n 40; python -c 'print("models present")'"

  4. If you prefer a named volume: docker volume create avatar_models docker run -d --name avatar-xtts
    -p 5001:5001
    -v avatar_models:/root/.cache/tts_models
    avatar-xtts:latest

  5. Permissions fix (if needed) If you see permission errors on the host dir, make the host directory owned by UID 0 (root) or the UID the container uses. Example (host):

sudo chown -R $(id -u):$(id -g) ~/avatar_models

or make it root-owned if container runs as root:

sudo chown -R 0:0 ~/avatar_models

  1. Smoke test curl -s -X GET http://localhost:5001/health | jq -C '.'

curl -s -X GET "http://localhost:5001/synthesize_stream?text=hello+world&language=en" --output /tmp/stream_test.mp3; echo "saved /tmp/stream_test.mp3 size:" $(stat -f%z /tmp/stream_test.mp3)

curl -s -X POST http://localhost:5001/synthesize -H 'Content-Type: application/json' -d '{"text": "quick test","language":"en"}' --output /tmp/synth_test.wav; echo "saved /tmp/synth_test.wav size:" $(stat -f%z /tmp/synth_test.wav)