AI & gamesMay 16, 20269 min read

Palermo.ai: when the Game Master is deterministic and the actors are models

A weekend POC turned into a real split: Express rules engine for the Greek party game Palermo, plus LLM players that chat, vote, and take night actions. Here is how it works and what I would ship next.

Palermo is the Greek cousin of Mafia or Werewolf: hidden roles, loud day phases, quiet night phases, and a table that turns on incomplete information. A friend is shipping a Palermo Android app and wanted a credible path toward AI seats at the table. I built palermo.ai as a focused proof of concept: prove the boring half (correct rules and phase flow) so the interesting half (multi-agent behavior) has something trustworthy to lean on.

This post is about the architecture I ended up with, why “just call GPT” is not the whole story, and the next steps if the experiment graduates from demo to product.

TL;DR

Deterministic Game Master in Node: phases, votes, night actions, death, silence, logs. No model is allowed to change the rules.
LLM players receive a JSON snapshot of what their character knows, answer in structured json_object form, and the server maps that into the same actions a human triggers in the UI (public chat, whispers, killer channel, votes, night targets).
Two-call strategy: a single “orchestrator” pass batches many AI lines with human-like delays, plus a per-player fallback path when the batch is empty or votes still need a nudge after discussion matures.
Social texture: Greek-first prompts, per-name typing personas (Greek letters vs greeklish, different lengths and habits), and explicit vote-phase gating so bots do not race to a first vote before the table has talked.
Next: port the pattern into the friend’s Firebase stack, add evaluation harnesses for strategy quality, and decide whether LangGraph earns a place once the state machine grows more branches.

Why split the brain at all

Social deduction dies when the referee is fuzzy. If the model can “interpret” whether someone is dead, or skip a phase because the transcript felt done, you stop testing the game and start debugging vibes.

So the backend owns truth: gameData plus gameState, same as a human-driven session. Models only emit intent. The controller applies those intents the same way it applies HTTP payloads from the React client. That choice made it much easier to add AI seats without forking the game into a second, hidden ruleset.

Provider-wise the code is pragmatic: OpenAI’s SDK with OPENAI_API_KEY, or a DeepSeek-compatible base URL when you prefer that stack. Same json_object response shape either way. The important part is not the logo on the invoice; it is that tool calls are not the loop. The loop is “read state → propose structured action → validate on the server.”

What “playable with AI” actually meant in practice

1. Structured actions, not improvised transcripts

Each automated player gets a system prompt built from static role text in game_seed.json plus a long set of table-manners rules in Greek: stay in group-chat voice, avoid duplicate greetings, do not leak private roles into public chat, defend yourself when the vote tally turns against you, and so on.

The model returns JSON with action in chat_public, vote, night_action, and related channels. The server then runs the same guards humans hit: phase checks, alive and silence flags, duplicate suppression, and “is this vote allowed yet?” logic.

2. Orchestrator batching for table flow

Calling the API once per player per tick works, but it produces uncanny simultaneity or weird ordering. The orchestrator path asks for an array of turns with delay_ms, so the server can space lines to feel like typing and keep villains from all pivoting to the same vote in one beat. When the orchestrator returns nothing useful, the code falls back to shuffled per-player calls so the session does not stall.

That split lives in processAIActions in palermo-backend/src/services/aiAgent.js: try orchestrator first, repair votes if discussion has matured, otherwise sample a small batch of legacy per-player calls.

3. Vote discussion maturity

The failure mode I cared about most was instant bandwagon votes. Borrowing from tabletop instinct, the server tracks how much public chat happened after the vote phase began and enforces thresholds before an AI’s first vote locks in. After that, vote changes remain fair game when the dialogue justifies it.

That is a tiny piece of code compared to the prompt mass around it, but it is the sort of rule that shapes whether humans feel like they are playing with bots or against a slot machine.

4. Idle table rescue

There is a background poll (aiIdleWatcher) that notices when public chat goes quiet for too long while AI players are still alive. It triggers another orchestrator pass with an idle_public reason so the table does not freeze waiting for a human to carry the energy alone. Cooldowns keep it from spamming.

The surprising design cost: chat realism, not API plumbing

Once the state machine worked, most of the iteration went into voice: eight typing-style buckets, stable per player name, mixing formal Greek, messy punctuation, short telegraphic greeklish, and outright English-heavy lines so players do not feel like one LLM is wearing nine hats badly.

The prompts also nag the model about independence: one loud accusation should not flip every bot vote in the same HTTP response. That is part craft, part social-engineering of the weights.

Honest limits of the POC

In-memory sessions and a single-node Express server. Fine for demos, not for durable matchmaking.
No formal eval suite yet for win-rate fairness, collusion, or toxicity. Heuristics in prompts are not a substitute for logged playouts.
Language and culture are Greek-first. Porting the feel to another locale is not a translate call; it is new role copy and new typing personas.

Next steps

Product path with my friend: lift the GM + structured-action boundary into their Firebase Cloud Functions backend so Android clients keep owning UX while the server remains authoritative. Keep the prompts as data, not scattered string literals, so non-engineers can tune role voice.

Engineering path for me: if the graph of phases keeps growing, I will likely borrow LangGraph from my other multi-agent experiments in the same workspace rather than hand-rolling more orchestration forever. The palermo POC justified the pattern; LangGraph would formalize checkpointing and branching when we add variants or experiments.

Quality path: record anonymized transcripts, score outcomes against simple rubrics (did silenced players stay silent in public, did doctors behave medically coherent, did killers avoid impossible claims), and run regression tests when models or temperatures change.

Closing

Palermo started as a favour and became a useful reminder: multi-agent UX is mostly pacing, permissions, and enforcement, not raw model IQ. If you are building something similar, start from an honest rules engine, then let models improvise only where human players already improvise.

If you want to compare notes on social deduction + LLMs or on the Firebase port, use the contact links on nickouv.com. The palermo.ai repo stays private while we figure out product boundaries with the Android release, but I am happy to talk architecture.