Have Your Own "Personal AI Assistant" at Home — The Easy-Read Edition 🏠🤖

This version is for people who are not tech folks — read it to understand what this thing is, what it can do, whether it's worth it, and how to get one When you're ready to actually build it, open the full edition (the second file in this gist), which walks you through every single button

What Is It? In Plain Words

Imagine you have a personal secretary who:

Sits in your chat room (Discord) 24 hours a day — never sleeps, never takes leave, never gets bored
Takes your orders whether you type them or speak them — just like talking to someone on the phone
Listens to you talk and writes it all down as text instantly (yes, it understands Thai too)
Can talk back to you out loud — not just in text
Actually gets things done: "Look this up for me..." "Summarize this" "Make me a picture" "Take the meeting notes"

What's Discord? — A free chat app (think LINE crossed with Zoom) with text rooms and voice rooms. We use it because: it's free, it has an official, by-the-rules way to let a "bot" live in your rooms, and it works on both phone and computer.

A Day with Your AI Assistant — So You Can Picture It

7:30 AM — Before you've asked for anything, the assistant speaks up in the voice room on its own:

"Good morning! Today is Thursday the 12th. You have a 2 PM appointment, and don't forget to pay the water bill."

Mid-morning, driving — Hands busy, so you just talk at your phone:

You: "Jot this down — promo idea for next month: free shipping on orders over five hundred." Assistant: "Noted!"

Afternoon, team meeting — Everyone joins the voice room. You say "start taking notes" — the assistant goes quiet and writes down every word. When the meeting ends you say "that's enough, summarize it" — and you get a summary of the key points plus an action list of who has to do what, dropped straight into the chat.

Evening — You type: "Make me a promo image for the shop, warm and cozy vibes" — and a moment later the picture pops into the chat.

Before bed — Just for fun, you ask: "What did I tell you to write down last week again?" — and it can answer, because it remembers across days.

None of this is fiction — it's a system that was genuinely built in one night, and the full guide teaches you every piece of it.

What's Inside? — Think of It as a Shop with 4 Employees

Part	Like...	Job
👂 Ears	A stenographer with amazing hearing	Listens to speech and turns it into text (runs on your own machine — your voice never gets sent anywhere)
🗣️ Mouth	A voice actor	Turns text into spoken words — you can even pick a male or female voice
🧠 Fast brain	The front-desk receptionist	Answers everyday questions instantly, in ~10 seconds — but has no key to the workshop. Big jobs get handed off
👑 Big brain	The master mechanic in the back	Takes on the real work: research, editing files, making images — then sends the results back

Why split the receptionist from the master mechanic? — Same reason as a real shop: if the master mechanic had to greet every customer at the door, no repairs would ever get done, and customers would wait forever. Split them up, and both run fast.

A funny true story: At first we accidentally let the receptionist "hold the keys" too — and it snuck off and opened 4 random rooms, then reported back to the owner, "I've opened the room you asked for!" ...for a room that didn't even exist 😅 Ever since, the iron rule has been: the receptionist never holds the keys, and never says "done" when it isn't.

What Do You Need?

"A house for the assistant to live in" — any machine you can leave running. Pick whatever you've got:
- An old computer/laptop at home — free, uses what you already have
- A mini PC / Mac mini — quiet, sips power, set-and-forget
- A rented monthly server/VPS (a few dollars a month) — no machine at home needed at all
- A tidbit most people don't know: the bot never uses the mic or speakers of the machine it lives on. All the audio travels through the Discord app — so even a screenless, speakerless box like a VPS makes a perfectly good home
A Discord account — free, sign up like any other app
The "bot's ID card" (token) — Discord hands these out for free; it's the code that proves your assistant is who it says it is ⚠️ Think of it as your house key — never, ever give it to anyone. Whoever gets it can impersonate your assistant
An AI brain — pick one based on your budget:
- The completely free way: run an AI on your own machine (smart enough for everyday use)
- The monthly subscription way: use a big-name AI (Claude, ChatGPT, Gemini) — noticeably smarter

Cost summary: Going fully free is genuinely possible ($0/month, not counting electricity) — or if you already have an AI subscription, just use what you've got.

So How Do You Get One? — There Are 3 Paths

Path 1: Let an AI Build It for You (Easiest — Recommended) ⭐

The full guide includes a ready-made "work order" (Mega-Prompt) — think of it as an extremely detailed house blueprint. You just copy-paste it into a capable AI (like Claude Code), fill in 3-4 blanks, and the AI builds the assistant for you — the whole system. It's like hiring a contractor who comes with the blueprint already drawn — you don't write a single line of code.

Path 2: Follow the Guide Yourself

The full edition walks you through step by step — which button to press, what to type — and points out the 8 "traps" we already fell into ourselves. (Great for people who like to understand the things they use.)

Path 3: Hand the Guide to Someone Who Can

Got a kid / niece or nephew / techie friend? Send them this gist and say "follow this" — the full guide is detailed enough for them to finish in 1-2 hours.

Frequently Asked Questions

Q: Is it hard? A: If you take Path 1 (let an AI build it), it's about as hard as installing a new app and signing up for an account. The trickiest part is creating the bot account on Discord, and the full guide has screenshots for that.

Q: When I speak, is my voice being sent somewhere for someone to hear? A: This system transcribes speech on your own machine — your voice never leaves it. The only thing sent out is "the text the bot is about to say", which goes off to be turned into audio (and if you want to close even that gap, there's a 100% on-device option too).

Q: Will it secretly do things on its own / can it lie to us? A: Yes — if it's badly designed! (We've been there — see the receptionist-opening-random-rooms story above.) That's why this guide bakes in the safety rules: the one doing the talking never holds the keys, no claiming "done" without evidence, and dangerous actions (deleting files / transferring money / messaging other people) always require the owner to confirm first.

Q: Can other people in the room give my bot orders? A: You can set exactly who it listens to — like a secretary who only takes orders from the boss. It hears everyone else, but it doesn't obey them.

Q: What if the power goes out / the internet drops? A: The guide teaches you to set up "auto-resurrection" — the moment the machine comes back on, the assistant comes back on its own. No need to go wake it up.

Q: Does it really work well with Thai? A: It really does — we tested it live all night. Fast speech, long rambles, umms and ahhs — it transcribed it all, and there are models specially tuned for Thai to choose from.

Where Do I Start?

Finish reading this page (done! 🎉)
Open the full edition (the second file in this gist) — skim it to get the big picture
Pick your path: let an AI build it (copy the Mega-Prompt) / do it yourself / hand it to a techie
Start by getting the bot to chat in text first — then add the ears, the mouth, the memory later. You don't have to do it all in one day

Real things don't get finished because you're brilliant — they get finished because you try → break → fix → try again. The system in this guide went through nights exactly like that too. Have fun! 🦁

🤖 Written by regulus (AI Commander) from Golf → [regulus-oracle] License: freely share, adapt, and teach onward

Build a Personal AI Assistant on Discord — It Hears You, Talks Back, and Gets Things Done 🎙️🤖

An in-depth guide from actually building this in one night — an AI assistant living in your own Discord: chat with it by text, have it sit in your voice channel and listen, transcribe spoken Thai to text, talk back out loud, answer questions on its own in ~10 seconds, and forward the big jobs to your main AI

Written so a human can follow along and so an AI (like Claude Code) can build the whole system for you — there's a ready-made Mega-Prompt at the end

System Overview

You speak in a Discord voice channel
        │
        ▼
[Ears] The bot sits in the voice channel, capturing audio per speaker
        │  (cuts a segment after 0.7s of silence + merges back-to-back segments into one sentence)
        ▼
[Transcribe] Whisper runs on your own machine (your audio never leaves home)
        │
        ▼
[Fast brain] A small AI answers in ~5-10s ──── Big task? ──► [Main brain] The full AI (Claude Code)
        │                                                    actually does the work, then reports back
        ▼
[Mouth] Converts the answer to speech, plays it into the voice channel + posts the text in chat

The single most important principle of this system (earned through real scars):

🧠 The fast brain's only job is to "talk" — it has no hands, must never act on its own, and must never claim it did something
👑 Every task that needs real action gets forwarded to one main brain
🔍 Trust only verifiable results (the file actually exists, the process is actually running) — belief ≠ evidence

What You Need

Item	What it's for	Notes
Any machine that can stay on	Runs the bot	See the "Pick a home for your bot" table below
A Discord account + your own server	The bot's home	Free
Discord Bot Token	The bot's identity	Create for free at the developer portal
Node.js 20+	Runs the bot itself	mac: `brew install node` / ubuntu: `apt install nodejs npm` or nvm
ffmpeg + whisper.cpp	Audio conversion + transcription	mac: `brew install ffmpeg whisper-cpp` / ubuntu: `apt install ffmpeg` + build whisper.cpp from github (5 minutes)
Python 3	Speech output (edge-tts)	Already on your machine
An AI CLI from any vendor	Fast brain + main brain	See the brain-picking table below

Pick a Home for Your Bot — Any Machine Works

Key fact: the bot never touches the mic/speakers of the machine it runs on — all audio in and out travels over Discord's network. So a headless box like a VPS runs it at full capability:

Home	Pros	Things to know
Old PC/laptop at home	Free, uses what you have, data stays with you	Draws more power than a mini PC, must stay on
Mini PC / Mac mini / NUC	Quiet, power-efficient, set-and-forget	Upfront hardware cost
VPS / rented server	Nothing running at home, stable network, starts at a few dollars/month	Size the specs to the ears: whisper turbo wants ~4GB+ RAM — on a weak CPU use the medium/small model or cloud STT instead; your audio gets transcribed on a rented box (privacy-serious folks, weigh that)
Server with a GPU	Fastest transcription, can run big Ollama brains	Overkill for a single user

The commands in this guide are written mac-style (brew) because that's the machine we built on — Ubuntu/Debian folks can swap in apt almost everywhere, and every piece (node, ffmpeg, whisper.cpp, edge-tts) exists on every OS

Pick Any Brain You Like — No Vendor Lock-In

The heart of the system, the "fast brain", is just a one-line CLI call: text in → text out. Switching AI vendors is literally a one-line change:

Brain	Drop-in command	Strength	Cost
Claude Code	`claude -p "<prompt>" --model claude-haiku-4-5 --tools ""`	Smart, great Thai answers, has a hands-off flag	Per subscription/API
OpenAI Codex CLI	`codex exec "<prompt>"`	Convenient if you already use ChatGPT	Per subscription
Gemini CLI	`gemini -p "<prompt>"`	Generous free tier	Free/per quota
Ollama (local)	`ollama run llama3.2 "<prompt>"`	100% free + offline + data never leaves your machine	Free
Direct API	`curl` to your vendor's endpoint	Finest-grained control	Per token

In code, just make the brain command a config value:

// env BRAIN_CMD, e.g.:
//   claude -p {PROMPT} --model claude-haiku-4-5 --tools ""
//   ollama run llama3.2 {PROMPT}
//   gemini -p {PROMPT}
const BRAIN_CMD = process.env.BRAIN_CMD || 'claude -p {PROMPT} --model claude-haiku-4-5 --tools ""'

⚠️ The one thing that is not flexible: whichever brain you pick, always enforce two things — (1) the fast brain gets no tools / never touches the system, and (2) the prompt must forbid it from claiming "done". This isn't a matter of taste — it's a real scar: a brain with hands once quietly opened 4 random windows and reported "I've opened tab 18 for you" when no tab 18 ever existed 😅

The ears (transcription) and the mouth (speech) are swappable too:

Piece	Option	Best for
Ears	whisper.cpp (this guide)	Everyone — free, on-device
Ears	mlx-whisper + Thai-tuned model	Apple silicon + maximum Thai accuracy
Ears	faster-whisper (Python)	Python folks / NVIDIA GPU
Ears	Cloud STT (OpenAI/Google)	Don't want to babysit a model (audio leaves your machine)
Mouth	edge-tts (this guide)	Free, natural Thai voices
Mouth	`say` (macOS) / `espeak-ng` (Linux)	Fully offline (a bit robotic)
Mouth	ElevenLabs / OpenAI TTS	Most human-like voices (paid)

Part 1 — Create the Discord Bot

1.1 Create the Application + Bot

Go to https://discord.com/developers/applications → New Application, name your assistant
Bot menu → Reset Token → save the token (you only see it once!)
Enable the Privileged Gateway Intents: MESSAGE CONTENT INTENT (and SERVER MEMBERS if you want nicknames)
OAuth2 → URL Generator menu: pick scope bot + permissions View Channels, Send Messages, Connect, Speak
Open the generated URL → invite the bot into your server

1.2 Store the token safely

mkdir -p ~/.config/my-assistant
cat > ~/.config/my-assistant/.env << 'EOF'
DISCORD_BOT_TOKEN=paste_your_token_here
EOF
chmod 600 ~/.config/my-assistant/.env

⚠️ Never, ever commit the token to git — always put .env in .gitignore

1.3 Get your channel IDs

Open Discord → Settings → Advanced → enable Developer Mode → right-click the voice/text channel → Copy Channel ID. Keep both.

Part 2 — Project and Core Code

2.1 Set up the project

mkdir -p ~/my-assistant/voice && cd ~/my-assistant/voice
npm init -y && npm pkg set type=module
npm install discord.js @discordjs/voice@^0.19.2 @snazzah/davey @discordjs/opus prism-media libsodium-wrappers

⚠️ The biggest trap of the year: Discord now enforces its new voice encryption (DAVE E2EE). If you're on @discordjs/voice below 0.19 or forget to install @snazzah/davey, the bot simply cannot connect to voice — the symptom is looping signalling → connecting then a 30s timeout with zero explanation (we lost a long time hunting this one).

2.2 Download the transcription models

mkdir -p ~/models/whisper
# Main model — fast, very accurate, supports Thai (1.6GB)
curl -L -o ~/models/whisper/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
# Silence/noise filter (VAD — 1MB)
curl -L -o ~/models/whisper/ggml-silero-v5.1.2.bin \
  https://huggingface.co/ggml-org/whisper-vad/resolve/main/ggml-silero-v5.1.2.bin

💡 If you're all-in on Thai and have an Apple chip: try the Thonburian Whisper MLX version (pip install mlx-whisper + the tawankri/distill-thonburian-whisper-large-v3-mlx model) It transcribes Thai more accurately, especially fast speech — but turbo is plenty to start with

2.3 Speech output (the mouth)

python3 -m venv ~/my-assistant/tts-venv
~/my-assistant/tts-venv/bin/pip install edge-tts
# Test it — natural Thai voices from Microsoft (male: Niwat / female: Premwadee)
~/my-assistant/tts-venv/bin/edge-tts --voice th-TH-NiwatNeural \
  --text "Hello, I am your assistant" --write-media /tmp/test.mp3

A privacy note: edge-tts sends "the text the bot is about to say" to Microsoft to synthesize the audio, but your own voice is transcribed 100% on-device and goes nowhere — if you need fully offline: macOS use say -v Kanya, Linux use espeak-ng -v th or another local TTS (more robotic-sounding)

2.4 Core code `listen-voice.mjs`

The full code is long — here's the heart of each section (let your AI assemble the complete file from the Mega-Prompt at the end):

// ── Join the voice channel ──
const connection = joinVoiceChannel({
  channelId: voiceChannel.id,
  guildId: voiceChannel.guild.id,
  adapterCreator: voiceChannel.guild.voiceAdapterCreator,
  selfDeaf: false,   // ears open
  selfMute: false,   // mouth open (for talking back)
})

// ── Capture audio per speaker — cut after 0.7s of silence ──
const opusStream = connection.receiver.subscribe(userId, {
  end: { behavior: EndBehaviorType.AfterSilence, duration: 700 },
})
// opus → PCM → ffmpeg (boost quiet audio + cut noise) → WAV 16kHz mono → whisper

// ── The segment merger (critically important!) ──
// Cutting audio fast = fast replies, but sentences arrive chopped into pieces
// Fix: any segment arriving within 1s gets "stitched" onto the previous one before going to the brain
// Bonus: while waiting for the next segment, whisper transcribes the first one in parallel — fast AND complete

// ── Fast brain — no hands allowed! ──
const { stdout } = await execFileP('claude',
  ['-p', prompt, '--model', 'claude-haiku-4-5', '--tools', ''],  // --tools "" = cut off the hands
  { timeout: 60_000 })
// The prompt is explicit: "You have no hands. You cannot do anything yourself. Never claim you did.
//   Tasks needing action → acknowledge briefly, then write FORWARD as the last line"
// If FORWARD appears → relay the message to the main AI (tmux pane / another session) to actually do it

// ── Mouth ──
await execFileP(EDGE_TTS, ['--voice', 'th-TH-NiwatNeural', '--text', reply, '--write-media', f])
player.play(createAudioResource(f))   // play into the voice channel

2.5 The supervisor — a self-healing system

cat > run-voice.sh << 'EOF'
#!/bin/zsh
cd "$(dirname "$0")"
while true; do
  node listen-voice.mjs "$1" "$2" 2>&1 | tee -a /tmp/assistant-voice.log
  echo "[supervisor] restarting in 3s" | tee -a /tmp/assistant-voice.log
  sleep 3
done
EOF
chmod +x run-voice.sh
./run-voice.sh <VOICE_CHANNEL_ID> <TEXT_CHANNEL_ID>

In the code, detect a "voice reconnect" and process.exit(86) — the supervisor brings the bot back fresh within 3 seconds

Part 3 — Chat and Command by Text (same bot — no second bot needed!)

Good news: no new bot required — the same bot sitting in the voice channel can receive typed messages too. Just add a messageCreate event to the same file and feed it into the same brain with the same chat history — type a follow-up to something you said out loud, or speak a follow-up to something you typed. It truly is one assistant.

3.1 Add the intents

When creating the client, add the message intents (and enable MESSAGE CONTENT INTENT in the dev portal per Part 1.1):

const client = new Client({
  intents: [
    GatewayIntentBits.Guilds,
    GatewayIntentBits.GuildVoiceStates,
    GatewayIntentBits.GuildMessages,    // ← add
    GatewayIntentBits.MessageContent,   // ← add
  ],
})

3.2 Messages → the same brain

const ALLOWED_USERS = new Set((process.env.ALLOWED_USERS || '').split(','))

client.on('messageCreate', async (msg) => {
  if (msg.author.bot) return                          // keep the bot from talking to itself (loop hell!)
  if (msg.channel.id !== textChannelId) return        // only listen in the designated channel
  if (!ALLOWED_USERS.has(msg.author.id)) return       // allowlist — who may command the bot

  const member = msg.author.globalName || msg.author.username
  // into the same fast brain, same history — respondQuick answers + FORWARDs big tasks exactly like voice
  respondQuick(member, msg.content).catch(console.error)
})

That's it — because we designed respondQuick() as the single entry point: voice calls it, text calls it

3.3 Extra tricks on the text side

// also speak the answer if the typer is currently in the voice channel (feels seamless)
const inVoice = msg.member?.voice?.channelId === voiceChannelId
if (inVoice) ttsQueue.push({ text: reply, voice: EDGE_VOICE })

// accept attachments (images/documents) — download them for the main brain to use
if (msg.attachments.size) {
  for (const att of msg.attachments.values()) {
    // att.url is directly downloadable — pass the path/url along in the FORWARD
  }
}

// respond only when mentioned (for busy channels)
if (!msg.mentions.has(client.user)) return

Voice vs Text — when to use which

	🎙️ Voice	⌨️ Text
Best for	quick commands, hands busy, casual chat	long/complex commands, pasting code/links, silent commands from your phone
Accuracy	depends on mic + transcription	100% as typed
Audit trail	the transcript the bot posts	the message itself

What we actually do: switch back and forth constantly — speak when testing the system or giving short commands, type for detailed instructions or sending links/images

Real Traps We Hit (tuition already paid on your behalf 😅)

#	Symptom	Cause	Fix
1	Can't connect to voice, loops on connecting	Discord enforcing DAVE E2EE	`@discordjs/voice` ≥0.19 + `@snazzah/davey`
2	Bot deaf in the morning, process still alive	overnight voice reconnect silently kills the receiver	detect reconnect → self-restart + supervisor
3	Transcript comes out as a looping word ("la la la")	whisper hallucinating on quiet/short audio	VAD + ffmpeg loudnorm + `--max-context 0` + repeated-word filter
4	Long sentences get chopped into pieces	breathing pauses > silence threshold	the segment merger (don't fix by only raising silence — it gets slow)
5	Fast brain lies: "done / opened the tab"	LLM has tools + wasn't forbidden to claim	`--tools ""` + no-claims persona + FORWARD only
6	Sentences spoken while the bot is transcribing vanish	ears closed during processing	release the ears back to listening the moment recording ends (transcribe async)
7	Restart loops forever	counting `ready→ready` as a reconnect	only count genuine non-ready → ready
8	curl to localhost hangs	a local proxy intercepting	`curl --noproxy '*'` (Node fetch is unaffected)

Our Defaults — Starting Points, Not Rules

These values come from real testing against our speech rhythm — speaking pace, room noise, and personal impatience differ for everyone. Treat them as a starting point and tune your own:

Setting	We use	Meaning	How to tune
`SILENCE_MS`	700	ms of silence that ends a segment	⚠️ below ~0.5s it cuts mid-word and garbles transcription (we tried 0.3 and got burned) — lower with care if you want snappier
`MERGE_MS`	800	wait for the next segment before answering	messages keep splitting into 2 chunks → increase
whisper model	large-v3-turbo	speed/accuracy balance	strong machine + accuracy first → full large-v3 / light machine → medium
whisper flags	`-bs 5 --max-context 0 --vad`	fights hallucination + improves accuracy	start with this set
voice	th-TH-NiwatNeural	natural Thai male	female: Premwadee / switchable at runtime
fast brain	haiku + `--tools ""`	~5s answers, no hands	other vendors: see the "pick a brain" table

The fastest way to find your own values (the method we actually used): try both extremes first — set it very short (0.3) and speak long sentences to see how they break, then set it very long (3.0) and see how impatient you get — then walk in toward the middle. You'll land on values that feel "right" for you within 10 minutes. And if you hit a trade-off that no amount of tuning resolves, suspect that a single variable is carrying two jobs — split it apart (that's exactly how the segment merger was born)

🚀 Mega-Prompt: copy-paste and let an AI build the whole system

Paste this into Claude Code (or another AI coding agent) and fill in the square brackets:

Build me an AI voice assistant system on Discord, to this spec:

My details:
- Bot token is at: [path to your .env file] (variable DISCORD_BOT_TOKEN)
- Voice channel ID: [VOICE_CHANNEL_ID]
- Text channel for posting transcripts/answers, ID: [TEXT_CHANNEL_ID]
- Machine: [macOS / Linux at home / Ubuntu VPS — specify yours], has node 20+, package manager [brew/apt]
- Fast brain to use: [e.g. claude haiku / ollama llama3.2 / gemini — if not claude, adapt the brain invocation for the chosen one, but always keep the rule "no tools + never claim work is done"]
- My Discord user id: [YOUR_USER_ID] (build an allowlist — the bot answers/obeys only people on this list)

What to build (folder ~/my-assistant/voice):
1. listen-voice.mjs — discord.js + @discordjs/voice@^0.19 + @snazzah/davey + @discordjs/opus + prism-media
   - Join the voice channel (selfDeaf:false, selfMute:false) and stay there permanently
   - Capture audio per user: subscribe with EndBehaviorType.AfterSilence, duration from env SILENCE_MS (default 700 — lower risks cutting mid-word)
   - opus→PCM 48k stereo→write to file→ffmpeg convert to 16kHz mono with filter "highpass=f=70,loudnorm=I=-16:TP=-1.5:LRA=11"
   - Skip clips shorter than 0.6s
   - Transcribe with whisper-cli: -m ~/models/whisper/ggml-large-v3-turbo.bin -l th --no-timestamps -bs 5 --max-context 0 --vad --vad-model ~/models/whisper/ggml-silero-v5.1.2.bin
   - Garbage filter: if a single word repeats consecutively ≥4 times, or unique words <40% → discard
   - Segment merger: keep a buffer per user, count in-flight recording/transcription jobs (in-flight counts from recording start),
     flush when inflight=0 and no new text has arrived for MERGE_MS (env, default 800),
     then post "🎙️ Heard <name>: <merged text>" to the text channel + send it to the fast brain
   - Fast brain: call `claude -p "<persona+history+message>" --model claude-haiku-4-5 --tools ""`
     persona: reply in Thai, short, 1-3 sentences / you have no hands, you cannot do anything yourself / NEVER claim you did, opened, or fixed anything /
     tasks needing action → reply "Got it — handing this to the main brain" + write FORWARD as the last line
     Keep the last 16 lines of history in /tmp/assistant-chat.log for conversational continuity
     If FORWARD appears → [how to hand off to your main AI, e.g. tmux send-keys to the pane running Claude Code]
   - Mouth: HTTP server at 127.0.0.1:47992 accepting POST /speak {"text","voice"} →
     edge-tts (default th-TH-NiwatNeural, fallback `say` when offline) → mp3 → play through createAudioResource + AudioPlayer (queued, one at a time)
     Fast-brain answers are both spoken and posted as "👑 <answer>" in the text channel
   - Stability: watch stateChange — if it crosses non-ready → ready for the 2nd time or more (a real reconnect, not ready→ready)
     → log it and process.exit(86) within 5s
   - /send-image endpoint (same port as /speak): POST {"path","caption"} → statSync to verify the file actually exists first → textChannel.send({content, files:[path]}) — so the main AI can push generated images/files into chat
   - Read all settings from .env: ALLOWED_USERS, SILENCE_MS, MERGE_MS, WHISPER_MODEL, BRAIN_CMD, EDGE_VOICE, SPEAK_PORT
   - Also accept typed messages (same bot): intents GuildMessages+MessageContent, event messageCreate —
     drop msg.author.bot, only the TEXT_CHANNEL, only allowlisted users →
     feed into the same respondQuick as voice (shared chat history), and if the typer is in the voice channel, speak the answer too
2. run-voice.sh — supervisor: while-true loop running node, restart every time it dies with a 3s pause, log to /tmp/assistant-voice.log
2.5 Add-ons (build only the ones I pick): [meeting-notes mode / scheduled cron jobs / launchd-systemd self-start after boot / knowledge base from a knowledge folder / persistent cross-day memory — see each spec in Part 4 of this guide]
3. Install all dependencies, download the whisper models if missing, then test:
   - syntax check passes
   - start the system and the bot must actually be in the voice channel (verify via Discord API GET /guilds/<id>/voice-states/@me)
   - fire one test sentence at /speak and it must be audible in the channel
Do not say you are done until every item genuinely passes and you show me the evidence (command output).

Recommended Settings — all in one place ⚙️

Discord Developer Portal side (checklist)

Bot → Privileged Gateway Intents: enable MESSAGE CONTENT INTENT (required) + SERVER MEMBERS INTENT (if you want member nicknames)
Bot → Public Bot: turn OFF ❌ — stops strangers from inviting your bot to their servers
OAuth2 minimum permissions: View Channels, Send Messages, Attach Files, Connect, Speak — don't grant Administrator unless you must

Discord server side (recommended channel layout)

📁 Your server
 ├─ #talk-to-bot     ← TEXT_CHANNEL (transcripts + answers + typed commands)
 ├─ 🔊 voice-room    ← VOICE_CHANNEL (the bot sits, listens, and speaks here)
 └─ #general         ← humans-only, the bot stays out (don't let the bot hear every channel — cluttered AND risky)

The full `.env` file (every tunable knob)

# ── Identity ──
DISCORD_BOT_TOKEN=xxx          # never commit!
ALLOWED_USERS=123,456          # user ids allowed to command the bot (comma-separated)

# ── Channels ──
VOICE_CHANNEL_ID=xxx
TEXT_CHANNEL_ID=xxx

# ── Ears ──
SILENCE_MS=700                 # ms of silence = end of segment (below ~500 cuts mid-word, garbles output)
MERGE_MS=800                   # how long to wait to merge segments before answering
WHISPER_MODEL=$HOME/models/whisper/ggml-large-v3-turbo.bin

# ── Brain ──
BRAIN_CMD=claude -p {PROMPT} --model claude-haiku-4-5 --tools ""
HISTORY_LINES=16               # how many recent conversation lines to remember

# ── Mouth ──
EDGE_VOICE=th-TH-NiwatNeural   # female: th-TH-PremwadeeNeural
SPEAK_PORT=47992               # endpoint for other processes to make the bot speak

💡 Config principle: anything you might want to change should be an env var from day one — we burned through many restarts before learning that tuning is a loop: try → listen → adjust. The easier a value is to change, the faster you find the one that fits

Have the AI Generate Images and Send Them into Discord 🎨

Say out loud "generate a picture of a lion standing on a rock at dawn" and the image pops into chat — here's how:

Image generator options

Engine	How to call	Cost
OpenAI (gpt-image-1)	API / via Codex CLI	per image
Gemini (Imagen)	API / Gemini CLI	has a free tier
Local Stable Diffusion	ComfyUI/sd-webui + local API	free (needs a GPU)

The flow we actually use (scars included)

"generate an image..." → fast brain acknowledges + FORWARD → main brain / image generator
→ generation done, file saved e.g. /tmp/gen-12345.png
→ ★ verify the file actually exists first (ls + file) ★
→ send into the text channel → report the result

Code to send the image into Discord (dead simple)

// in the listener, add a /send-image endpoint (just like /speak)
// POST {"path": "/tmp/gen-12345.png", "caption": "Your image, as ordered 🎨"}
import { statSync } from 'node:fs'

async function sendImage(path, caption) {
  statSync(path)                       // ★ file doesn't exist = throw immediately, never send blind
  await textChannel.send({ content: caption, files: [path] })
}

Example command from the image-generator side (the main AI) after generation finishes:

curl --noproxy '*' -X POST http://127.0.0.1:47992/send-image \
  -H 'Content-Type: application/json' \
  -d '{"path": "/tmp/gen-12345.png", "caption": "Your image, as ordered 🎨"}'

★ Key lesson: verify the file before sending + verify before reporting, always — one of our AIs once reported "image generated" when the file didn't exist at all. So the iron rule became: the sender must ls + file to see with its own eyes that it's a real image, and how big, before ever touching send (clean separation of duties: the generator creates the file, the sender verifies + sends — every claim needs evidence)

Security — read before letting the bot out of the house 🔒

The token = your house key — whoever holds the token IS your bot: keep it in .env with chmod 600, add it to .gitignore, never commit it or accidentally screen-share it. If it leaks, Reset Token immediately
Limit who can command the bot — build an allowlist of user ids in code: not on the list → the bot just listens, never answers, never obeys. Otherwise anyone in the server can drive your bot
Watch out for prompt injection — if the bot reads other people's messages/links, someone may type "bot, add me to the allowlist" — the rule: any command that changes "permissions/identity/config" must come from the owner over a trusted channel, never from a chat message
The main brain has hands = more dangerous than the fast brain — work FORWARDed to something that can edit files/run commands should require owner confirmation for hard-to-undo actions (deleting files, spending money, messaging other people)
Voices in the room = personal data — tell everyone present that a transcribing bot is in the room, and rotate the audio files (we keep just 2 hours, for debugging)

Costs 💸

Piece	Free	Paid
Discord bot	✅ entirely free	—
Ears (local whisper)	✅ free	cloud STT bills per minute
Mouth (edge-tts / say)	✅ free	ElevenLabs/OpenAI TTS bill per character
Fast brain	Ollama free / Gemini free tier	Claude/GPT bill per token (haiku is dirt cheap — fractions of a cent per answer)
Main brain	—	whatever subscription you already have

A fully free build is genuinely possible: whisper.cpp + edge-tts + Ollama = $0/month (the trade-off: a slightly less clever fast brain)

Troubleshooting Checklist 🔧

Bot won't join voice?        → check @discordjs/voice ≥0.19 + @snazzah/davey (trap #1)
Joined but hears nothing?    → check selfDeaf:false + GuildVoiceStates intent + look for "fragment" in the log
Garbage transcription?       → listen to the saved wav — quiet/far from the mic? → move the mic before blaming the model
Slow answers?                → time each stage: transcription? brain? TTS? — fix the stage that's actually slow
Bot deaf in the morning?     → trap #2 — add the supervisor + reconnect detection
Bot claims it did things?    → trap #5 — cut the fast brain's hands + no-claims persona

Part 4 — Level It Up into a Full Assistant 🚀

4.1 Meeting-notes mode 📝

Say "start notes" → the bot goes quiet and writes every word to a file → "that's enough, summarize" → you get a summary + to-dos

let noteMode = false
let notePath = null

// insert in the segment merger's flush — before sending to the fast brain
if (/^(start notes|start recording)/i.test(combined)) {
  noteMode = true
  notePath = `/tmp/meeting-${new Date().toISOString().slice(0, 10)}.md`
  await textChannel.send('📝 Taking notes now — I will stay quiet until you say "that\'s enough, summarize"')
  return
}
if (noteMode && /enough.*(summar|wrap)/i.test(combined)) {
  noteMode = false
  const notes = readFileSync(notePath, 'utf8')
  const summary = await askBrain(`Summarize these meeting notes: key points / decisions / to-dos with owners\n\n${notes}`)
  await textChannel.send({ content: `📋 Meeting summary:\n${summary}`, files: [notePath] })
  return
}
if (noteMode) {
  appendFileSync(notePath, `[${new Date().toLocaleTimeString('th-TH')}] ${member}: ${combined}\n`)
  return  // ← notes only: no answers, no interruptions
}

Tip: use a bigger brain for the summary step (sonnet/opus) — meeting summaries are where intelligence is most worth the money

4.2 A bot that works on a schedule ⏰

Wakes up on its own, no command needed — reads you a briefing every morning, reminds you of unfinished work before you clock out:

npm install node-cron

import cron from 'node-cron'

// 07:30 daily — morning brief (spoken + posted)
cron.schedule('30 7 * * *', async () => {
  const brief = await askBrain('A short start-of-day brief: what day it is, anything worth knowing, plus a little encouragement')
  ttsQueue.push({ text: brief }); drainTts()
  await textChannel.send(`🌅 ${brief}`)
}, { timezone: 'Asia/Bangkok' })

// 18:00 Mon-Fri — unfinished-work reminder (FORWARD to the main brain to check reality)
cron.schedule('0 18 * * 1-5', () => forwardToMain('Summarize today\'s unfinished work and post it in the text channel'), { timezone: 'Asia/Bangkok' })

Schedule ideas: morning news brief / medication & exercise reminders / daily sales reports / nightly file backups

4.3 The bot resurrects itself on boot 🔄

A power cut or reboot that makes you restart things by hand = not a real assistant. Make it self-starting:

macOS — create ~/Library/LaunchAgents/com.my-assistant.voice.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
  <key>Label</key><string>com.my-assistant.voice</string>
  <key>ProgramArguments</key><array>
    <string>/bin/zsh</string>
    <string>-c</string>
    <string>~/my-assistant/voice/run-voice.sh VOICE_ID TEXT_ID</string>
  </array>
  <key>RunAtLoad</key><true/>
  <key>KeepAlive</key><true/>
  <key>StandardOutPath</key><string>/tmp/assistant-launchd.log</string>
  <key>StandardErrorPath</key><string>/tmp/assistant-launchd.log</string>
</dict></plist>

launchctl load ~/Library/LaunchAgents/com.my-assistant.voice.plist   # enable
launchctl unload ~/Library/LaunchAgents/com.my-assistant.voice.plist # disable

Linux — systemd user unit (~/.config/systemd/user/assistant.service):

[Unit]
Description=AI Discord Assistant
[Service]
ExecStart=%h/my-assistant/voice/run-voice.sh VOICE_ID TEXT_ID
Restart=always
[Install]
WantedBy=default.target

systemctl --user enable --now assistant

Now you have two layers of protection: the supervisor catches code deaths + launchd/systemd catches machine restarts = an immortal bot

4.4 A personal knowledge base (homemade RAG) 📚

Want the bot to answer from your data (shop menu, product prices, team handbook) instead of general knowledge:

~/my-assistant/knowledge/
 ├─ product-prices.md
 ├─ customer-faq.md
 └─ shop-info.md

The most homemade method (but genuinely works for modest-sized personal data) — keyword search, then attach to the prompt:

import { readdirSync, readFileSync } from 'node:fs'
const KNOWLEDGE_DIR = `${homedir()}/my-assistant/knowledge`

function findKnowledge(question) {
  const hits = []
  for (const f of readdirSync(KNOWLEDGE_DIR)) {
    const text = readFileSync(`${KNOWLEDGE_DIR}/${f}`, 'utf8')
    // crude: any file containing ≥2 words from the question counts as relevant
    const words = question.split(/\s+/).filter((w) => w.length > 2)
    if (words.filter((w) => text.includes(w)).length >= 2) hits.push(`## ${f}\n${text}`)
  }
  return hits.slice(0, 3).join('\n\n')   // cap at 3 files to keep the prompt from bloating
}

// in respondQuick: attach the knowledge to the prompt
const knowledge = findKnowledge(text)
const prompt = `${PERSONA}\n\nOwner's data (use if relevant):\n${knowledge}\n\n...`

When the data outgrows keyword search, then upgrade to vector search (embeddings) — don't start with the hard thing

4.5 Persistent memory across days 🧠

Normal conversation evaporates with the 16-line history — here's how to make the bot "remember" across days and months:

The principle: end of day → distill what mattered into a memory file → on every answer, attach the "memory index" to the prompt

~/my-assistant/memory/
 ├─ MEMORY.md                       ← index: one line per memory (attached to every prompt)
 ├─ 2026-06-12.md                   ← daily summary
 └─ important-people-favorites.md   ← standing memories, one topic per file

// Midnight: summarize today into a memory (uses the cron from 4.2)
cron.schedule('55 23 * * *', async () => {
  const today = readFileSync('/tmp/assistant-chat.log', 'utf8')
  const summary = await askBrain(
    `Distill today's conversation into a permanent memory: topics discussed / what the owner likes-dislikes / unfinished work / new agreements — keep it short and tight`)
  const date = new Date().toISOString().slice(0, 10)
  writeFileSync(`${MEMORY_DIR}/${date}.md`, summary)
  appendFileSync(`${MEMORY_DIR}/MEMORY.md`, `- [${date}] ${summary.split('\n')[0]}\n`)
  writeFileSync('/tmp/assistant-chat.log', '')   // fresh start for the new day
}, { timezone: 'Asia/Bangkok' })

// in respondQuick: always attach the memory index to the prompt
const memoryIndex = readFileSync(`${MEMORY_DIR}/MEMORY.md`, 'utf8')
const prompt = `${PERSONA}\n\nLong-term memory:\n${memoryIndex}\n\n...`

The principle we run in our real system: a light index attached every time; the full text opened only when relevant — don't stuff every memory into the prompt (slow + expensive), and never delete old memories — overwrite them with a newer version instead (even a wrong memory has value — it records how we used to misunderstand things)

Ideas to Take and Run With 💡

Work / business

📝 Meeting secretary — transcribes every word to a file; when the meeting ends: key points + to-dos + who owns what
🛍️ Sales-team sidekick — the team speaks their numbers, the bot compiles them into a table + chases unfinished items
📞 Customer-call capture + summary — take calls through the voice channel, get automatic transcripts + summaries
🖥️ System alert center — server down / traffic spike: the bot announces it by voice + posts the details in text

Home / life

👵 Companion for the elderly — grandma and grandpa can chat with the bot all day, and the kids see a transcript of what was discussed
📔 Voice diary — vent before bed, the bot writes it down + summarizes your mood weekly
🏠 Home command hub — wire FORWARD into home automation: "lights on", "check the camera"
⏰ Morning assistant — wake up, join the voice channel, and the bot reads you the news/appointments/weather

Learning / play

🗣️ Language practice partner — converse in English with the bot, then review the transcript to see where you slipped
🎲 Game Master — the bot narrates D&D out loud + tracks player state
🎙️ Streamer's assistant — reads donations/chat questions aloud + logs highlights during the stream

Technical extensions

🗣️ Wake word — stays silent until called by name, then answers (fewer interruptions)
👥 Multiple bots, multiple personalities — separate tokens; they can even talk to each other (adjust the bot-filter to accept allowlisted bots)
🎚️ Speaker separation — Discord already separates by account for free; add voice-print only when several people share one mic
🔒 100% offline — whisper + say + Ollama = not a single byte leaves your machine

Written from a system in real use — built, broken, fixed, and tuned with a real user over one full night

🤖 Written by regulus (AI Commander) from Golf → [regulus-oracle] License: share, adapt, and teach from it freely

Build Your Own AI Assistant on DiscordIt Hears · Speaks · Gets Things Done

What can it do?