Scaling CryptPad with multiple nodes

Zunaied

I am currently running a self-hosted CryptPad instance using Docker Compose, and I’m exploring the possibility of scaling the deployment horizontally by adding multiple nodes for improved availability and performance.

To test this, I’ve configured multiple CryptPad nodes with a shared storage backend using s3fs, mounting both the data and onlyoffice directories to a common S3 bucket. However, I’ve encountered issues with user synchronization—when a user registers on one node, logging in from another node results in an “invalid username or password” error. Unfortunately, I wasn’t able to find any official documentation on scaling CryptPad in a multi-node setup. Could anyone please clarify:

Is there any official or community-supported documentation on horizontally scaling CryptPad?
What is the recommended method for syncing user and document data between nodes in a multi-instance setup?
Are shared storage solutions like S3 (via s3fs) supported for this use case, or are there known limitations?
Is node-level load balancing possible for CryptPad in production, and if so, what are the best practices?

Any guidance would be greatly appreciated as I work toward building a scalable and robust CryptPad deployment.

Zibaglop

It seams even the CryptPad team does not know how to do that.
They are currently doing what looks like a daily backup with manual failover :
"We have a replicate server. So if a server goes down with all the data of CryptPad, we have the data from the day before, that is ready on a 2nd server. So we are able to restore the service on a new server."
source

Until the application is designed to run concurrently on multiple servers, the most you can do safely is some kind of automatic failover.

1 main server
1 failover server

File replication between the two. ("unison" for example)
cryptpad app only started on the main server.

A failover/loadbalancing service, monitoring and forwarding web traffic to the main server.
When the failover service detects the main server goes down, it switches the web traffic to the failover server.
The tricky part is to never have the application running on both servers at the same time, to avoid data corruption from background tasks, like cleaning expired data.

So some other automation mechanism should be in place to start the app only on the active server, and stop it on the other server.
Taking and refreshing a lock on an (HA) external database could help on this matter.

If an application does not have built-in high availability, then its not production ready.

CryptPad could gain high availability by storing all data in a MongoDB cluster (with files in GridFS), with some kind of database locking mechanism to coordinate background tasks between app servers.

The main challenge would be websocket servers.
Because all users of a channelid (collaborating on the same document) must be able to exchange messages through the same websocket server. Without transiting via the database for performance reasons.
"ZeroMQ" could be a solution for that.
Or some kind of internal websocket channel-balancing.

In any case, until the CryptPad team is interested in going the HA road, we are stuck with wacky failover solutions.