Building a Multi-Model Intelligent Gateway: Ciyuano's Technical Architecture Revealed

2026/06/04·2 min read·41 views

Overview

The core of Ciyuano is a smart model gateway. This article walks you through its architecture design.

System Architecture

User Request → Authentication & Billing → Model Router → [DeepSeek/GLM/Qwen Channel] → Streaming Response

Core Modules

1. Channel Manager

Each upstream model provider is abstracted as a "channel":

typescript

interface Channel {

id: number;

provider: string; // deepseek | zhipu | aliyun

modelId: string; // deepseek-v4 | glm-5

baseUrl: string; // upstream API address

apiKey: string; // upstream API Key

weight: number; // load balancing weight

isEnabled: boolean; // whether enabled

healthStatus: string; // healthy | degraded | down

}

2. Smart Router

Routing decision flow:

Model parsing: auto → all channel candidates; specified model → match channels

Health filtering: exclude channels with isEnabled=false or healthStatus=down

Weighted selection: weighted random by weight, higher weight → higher probability of selection

Failover: request fails → auto-mark → retry next channel

python

def select_channel(model, channels):

candidates = [c for c in channels

if c.is_enabled and c.model_id == model]

if not candidates:

raise NoChannelAvailable()

total_weight = sum(c.weight for c in candidates)

r = random.uniform(0, total_weight)

cumulative = 0

for c in candidates:

cumulative += c.weight

if r <= cumulative:

return c

3. Health Check

The backend regularly sends lightweight probe requests to each channel, marking three states: healthy / degraded / down.

4. Protocol Conversion

API formats from upstream vendors vary widely, uniformly convert to OpenAI format output:

Upstream → Unified Intermediate Representation → OpenAI format response

5. Real-time Billing

python

cost = (

prompt_tokens * channel.prompt_price_per_1k / 1000 +

completion_tokens * channel.completion_price_per_1k / 1000

)

key.balance -= cost

key.total_cost += cost

Key Design Decisions

Why use SQLite instead of PostgreSQL?

Simple deployment: zero configuration, one file does it all

Sufficient performance: SQLite's read/write performance is more than enough for medium-scale API services

LiteFS / Turso and other solutions can easily scale to multi-node

Why not do model fine-tuning?

Our positioning is a "gateway" rather than a "model factory". Focus on channels and routing.

Why not introduce a caching layer?

Same input in AI conversations may not produce the same output (temperature, random seed). Blind caching may undermine generation diversity.

Performance Metrics

The additional latency of Ciyuano is typically 50-150ms, mainly from authentication, billing, and routing decisions.

When streaming, the time to first token (TTFT) is almost unaffected because we use a transparent proxy rather than buffered forwarding.

Summary

A good API middle station technically needs to do three things: compatibility, routing, billing. Doing these three well is the greatest value to developers.

Building an Intelligent Customer Service Robot Using AI API: A Complete Practical Solution

Intelligent customer service robots are one of the most common application scenarios of AI APIs. This article will guide you from scratch to build an intelligent customer service system that supports multi-turn dialogue, knowledge base retrieval, and streaming output.

Dev Practice

Build a DeepSeek chatbot using Streamlit and Ciyuano

Build a complete AI chat web application with Streamlit in 30 minutes, supporting streaming output, conversation memory, and multi-model switching. Entirely pure Python, zero frontend code.