Note: This is not a tutorial that you can find just by googling “graphql subscription” or look up on apollo server documentation. Using graphql subscription can meet some unexpected difficulties and I have not found any posts talk about it in detail (I only see simple tutorials and code sample and non of them solve my problem). So I decided to write one about why graphql subscription is more difficult and what can we do to improve the performance. I briefly talked about it in a previous post, but that was not detail enough.
The post is split into several sections:
- What is a websocket server look like ?
- Use Background worker to do the expensive work.
- What is the difference between regular broadcast vs graphql pubsub ?
- What can we do to improve graphql pubsub performance ?
What is a Websocket Server look like ?
Websocket actually just means the connection protocol between browser and server. I think what we actually mean, is a server that keeps live connections with a number of clients, and have ability to receive message and broadcast message to all clients (pub/sub).
Basic WSS(WebSocket Server):
A barebone websocket server is simple, it just need to keep a list of connections, and whenever it needs to broadcast message, just loop through the list and send
message to each of them.
Simple enough right ? To receive message from client, like a chat app, you could add “on message” in the con
, like the following:
In reality, the place where you broadcast any message (call the publishMessage
) function is best to be put outside of a request cycle. In a case if you have a lot of clients, (say a million), simply loop through clients and send message would cause the request cycle to lag quite a bit. The following example is an “not ideal” handling of broadcast:
Using Redis pub/sub to broadcast asynchronously
A common pattern to jump out of this, is to design the system architecture such that the websocket server (handles broadcasting) and http server (handles create user in this example) to be in different machines.
Note the “Send signal” part should be very fast request compare to calling the publishMessage
function and it broadcast asynchronously. This make scaling the servers easier, (If you get a lot of ws clients, add wss server, get a lot of GET/POST requests, add regular servers).
So how do we “Send signal to broadcast” ? This is the case where we normally use some pub/sub software, such as redis.
Obviously in reality, all the clients may be listening to different topics, so there is a redis channel that we could use to separate clients into groups, and send messages to specific groups.
Complex broadcast logic in background worker
In some case, the broadcast message is not straightforward, it might need some expensive calculations. In this case, we should NOT put those calculation in wss server like the following:
Instead, the work should be delegated to background worker:
Moreover, this pattern also have one important benefit: debounce tasks to reduce redundant work. In simple terms, if a background job is triggered multiple times in a short time, the job should only need to run once. This technique was described in more detail in a previous post.
Summary
An efficient WebSocket server should only handle the connected clients (put in different groups, assign ID to them, can easily search client by ID etc), and broadcast message to clients with the message received from redis. The broadcast and expensive computation should be delegated to background worker.
Regular Websocket broadcast vs Graphql Subscription
A regular websocket broadcast works as the following figure depicted:
When an event is triggered, server start a background job to calculate some result, then it is published to redis, which notifies all redis subscribe clients (websocket server), and then server receives the result and loop over all connections
(or find the target connection) and send result to client(browser).
A Graphql server, however, works differently.
Regardless of what library you use (Apollo Server, Envelop etc). The above figure demonstrate that when an event is triggered, there is no background worker to process some result. The main reason is that each client, although subscribing to same Rootfield, could be asking for different data.
As shown in above code, two queries are both subscribed to the userUpdated
event, but they ask for different fields (query2 ask for extra posts
data).
Because of this, Graphql Server must execute each client’s subscription query when broadcasting the event. Which means with n
clients, the server need to process n
queries. This may have significant impact on the performance and is very different compare to the regular websocket server setup, where all clients subscribe to same data.
The impact is two fold:
- Spending resource in WS server vs Spending resource in background worker
- Calculate
n
queries instead of1
.
WS Server vs Background worker
Obviously, if the computation resource is unlimited, then non of this performance impact matters, just throw in more money and the problem is solved. In general, scaling workers is much easier than scaling servers. If your server is not powerful enough, scaling the server, (add more machines or change to a more powerful machine) is more expensive than scaling the workers. Where a worker could just be any machine that runs a script, triggered by a broker, an extra server would require load balancing, reverse proxy etc. (A broker is a service that allows server to put task in, and broker send the task to workers).
Therefore, if possible, the ws server should just be a thin layer of “receive result from somewhere, and send to client”, instead of “receive signal, and calculate result, and send to client”.
However, this rule is hard to be applied to GraphQL server. The reason is that when a client subscribe to server, it also send a GraphQL query that explains the data it request for. If you want to use background worker to calculate the result, the worker will need to know the GraphQL query. Allowing the worker to know the GraphQL query is quite complicated. Not impossible, but complicated.
You could send the query text as parameter, that is an easy solution, but the query text is usually pretty long, which could eat up a lot of bandwidth for the broker.
Another option is to store the subscription query of client in database, or using memcache using hash of query as key, then background worker could fetch the query and run it and send the result back.
The above two options are both feasible ways to redirect the expensive computations to workers. However, the designs are pretty difficult to implement in the current GraphQL framework.
Let’s take a look at Apollo Doc subscription demo code:
When client
subscribe to postCreated
field, server create and stores an AsyncInterator
object. This object is then triggered, whenever the pubsub
framework (in our case, the redis
) broadcast a ‘POST_CREATED’ event. Immediately when the object is triggered, server will then execute the resolve
function that’s associated with the postCreated
field. Using our userCreated
as example:
The process flow is as follows:
To implement the above two approaches so that we can redirect the expensive calculations to background worker, we need to modify the process flow to:
Needless to say, this will require a significantly amount of work to hack the existing AsyncIterator interface. I have spent time and hacked around it, but it is not close to a production ready and stable solution.
So what now? What can we do to improve GraphQL Subscription?
If we cannot really do it like this background worker solution, at least, let’s reduce some redundant work.
You problem have already noticed, the ws server stores a list of record for every client that subscribe to a field, (e.g. userCreated
field), then when the event is triggered, every record will call the associated resolve
function once and send the result to client. This suggests that all these redundant calls to the resolver function (even though their parameters are different), we may still save a lot of time if we don’t do redundant work.
For example, we could have a global cache layer for the datasource when calling them in resolver.
Obviously this may cause dirty data issue (data not up to date), in theory, the caching is a generic solution to the entire graphql layer, but for purpose of reduce redundant work of GraphQL subscription, we just want all the clients that are triggered by the same UserCreated
event to group together and call the datasource once. It is effectively “instead of cache the value for a certail amount of time, cache it for 1 tick”.
This could be achieved by using dataloader, Instead of “cache it for 1 tick”, it collects all requests in 1 tick, and execute only once.
Note that this dataloader need to be globally available, and long lasting (not per request), because of this, the cache
option need to be turned off, because enabling global in-memory cache is dangerous as it may cause data out of sync between different servers. (server1’s cache still has unexpired data, server2 is expired and fetching for fresh data).
If you want to know more, I suggest look back on one of my previous post.
Summary
In this post, I explained in detail what a regular pub/sub websocket server would look like, and compared with GraphQL subscription. I suggest that GraphQL due to it’s “execute query per client per event” nature could easily impact server performance. A large scale subscription server must be carefully designed and optimized. I gave some examples on what could be done to improve the performance of subscription.
If you have more questions, feel free to post comments, I will get back to you!
Also, follow me please!