Postgraphile in Production: What to do and not do.

Han
8 min readMar 26, 2022

Postgraphile is a very performative one click Graphql server. We tried it three years ago and really liked it and have been using it ever since. However, as our business grows, we came across a number of pain points that makes our api server unbearable. A number of design flaws were revealed as the data size grows. I compiled a list of things that we should have enforced at the beginning before it is too hard to refactor.

Don’ts

Don’t use connection-filter plugins without understanding the consequence.

There are several plugins that makes front end developer’s life very easy, such as connection-filter. As a backend developer, it was very tempting to just enable this plugin so that front end developer don’t have to ask backend to create new api or resolvers, they could just use some filter and get the data. However, as the nature of any application is to be backward compatible, once you enable this plugin, it is almost impossible to disable it in the future. Things soon get out of hand, some intern or junior developer end up injecting some crazy filters that worked ok with less data, but two years later, exploded on our face. It took quite a while to track that down due to a bunch of mixed issues which I will cover later. Instead, try to manually use makeAddPgTableConditionPlugin, it takes some time at beginning, but trust me it will worth it. It makes refactoring much easier.

Of course, it is a very useful plugin and I am definitely not saying don’t use it, I just mean when you do use it, really make sure your team member understands the limitation of the plugin and use e2e test to ensure it does not break.

Don’t make expensive computed_columns

Postgraphile makes it so easy to add new fields to existing models, you just have to make a pg function named a specific way([model]_[field_name]), it will appear in [model] as a new field [field_name]. But it is very crucial to make it not expensive. Adding aggregate functions and such almost inevitably will have a scale issue. It might be ok now, but once data size grows, you get paged at night.

Also, when you added an computed_column, at some point later project manager may ask you if they can fetch data with filters on this field, or order by this field, then you added a makeAddPgTableOrderByPlugin and because this pg function is calculated for every record in the table in order to do the OrderBy, it may very lightly be slower and slower. My company’s table grows exponentially in the past 3 years and we started to notice a lot of these computed_columns taking too much resources and even break the entire database.

Don’t make expensive views

I know I said “ pg function is calculated for every row” before, but postgresql may be smart enough to inline the function SQL into the main SQL and do some rearranging and speed up the process dramatically. Same kind of optimization also applies to PgViews, sometimes it is applied to views better than pg functions. Obviously with the power of PgViews, and smartTag foreignKeys, it is very easy to extend a model with more fields. For example

type User {
id Int
Name String
}
# This is a view with a fake foreign key comment on the table so # these two models are associated.
type UserInfo {
userId Int
postCount Int
avgReview Float
}

But it is also very important to now keep in mind if scaling issue of views. We had a tree structure models in our database, a service that go through a number of changes (happened in any time) and end up with a final price.

To answer question like “What was the price of all service on 2021–02–01”, or “What was the price of all services two month later”, we used postgres recursive view to calculate all the service’s price at any point in time. Then we use a smartTag foreignKey to associate the view with service table and done, we get the data. However, a year later our service table and also the changes table size increased dramatically and this view become very expensive, so we made it into a materialized view and refresh every hour. And here comes to my third point:

Don’t make extremely large materialized views

The recursive view is so much larger than the service table and changes table combined, each time it refreshes it takes so much memory and almost completely freeze the database for a couple of minutes or even break postgres and force it into restart. This become unbearable very quickly, because when postgres is down, everything is down.

We once were asked to build a front end that shows all the financial records of each customer and we decided to put that into a materialized view and refresh every hour. Because the view is built from many tables, (invoices, bills, services, users, invoices from third party etc), and so complex (over 500 lines of SQL) the refreshing is so expensive that when it reaches 10 million records it’s just not possible to do anything else for about 10 minutes.

Summary of Don’ts

Basically all of these don’ts here are really pointing to one thing: Don’t make read request taking too long. This is obvious at first, but very easily forgot when developers are rushed to deadline and chose the easy route. (For a better alternative, see the Dos list below). Even if you do create view/pg_functions, really think about scalability, stress test these views before putting them to production.

So the lessen learned here is:

  1. Always think about scalability when building pg_function, pg_views and materialized views.
  2. Restrict on what client can filter/order so that it will be easier to optimize in the future instead of giving them all the freedom.

Dos

Read official doc

improve performance and best practice, they are quite important and extremely helpful.

Take the Time to build Triggers correctly

This is a very broad topic, let me walk through it. Usually as the application starts to grow, the logic of fetching some GraphQL queries might become more complex, client would request for some very specific features that cannot be easily retrieved by search by ID. Let’s say you have a blog website, and you are building an admin panel. A starting GraphQL might support a field like getPosts . But soon, there may be request for feature like “give me list of posts that are from authors who have received at least 10 five stars last week”. There are two ways to approach it:

  1. Option 1: Spend more time in Read: Building a complex SQL to retrieve it would be easy strategy. But there is a significant issue of this: As the data grows, the complex SQL takes longer and longer to execute, end up causing this GraphQL field too slow to complete.
  2. Option 2: Spend more time in Write: A better alternative, in this example, would be to create a column called “have_ten_stars_last_week” in author table and build triggers that update this value on every times an author is given a star. This way the read query is much faster and scalable. We just need to make sure the field is always up to date.

Compare to option 1, option 2 is much more complex to implement. Assume the trigger is fired on mutation, certain framework needs to be build on it to ensure the trigger fires correctly and consistently. For example, a common way to setup a trigger is like this:

This certainly works, but if more mutations for user is added, then you have to put the trigger everywhere and it is very easy to forget. So a slightly better way is:

Trigger should also be deferred when bulk updating. Sometimes, to ensure mutation resolves asap, the trigger action is put into background worker so that it does not block the main thread, then there will also need some retry mechanism to ensure the trigger action must complete, so that the column value is up to date. If we really wanna do it right, this might expand to a pub/sub framework and with Domain Driven Design to establish a full framework, but that is out of scope of this blog.

The key here, is that when facing a feature request that requires the developer to pick between these two options, always try to use option 2 if possible. This slows down the initial develop time, but it is much more sustainable and scalable.

In PostgreSQL term, try to not use view which essentially adds time in Reading, but instead build columns to store good intermediate values to speed up the read query.

Write your own mutations soon

This is a quite opinionated point. PostGraphile can get you pretty far as a one click solution, but it has its limitations. In my web applications that I have encountered, there exists at least some models that fit the CRUD operation and would benefit with these generated query and mutation fields. When the web application is simple, a lot of bad habits are ok. For example, people even tend to get a lot of data and calculate summary stats in front end. Which would be a very bad idea when the data grows. As the application grows, at least three things could happen:

  1. Generic Create mutation is no longer feasible with large models, or tables that has been vertically partitioned. Also there is also a suggestion about design to avoid anemic mutations, even objections of entire anemic model design with pure CRUD, which I feel a bit too strong, but it does make sense to some degree. When these things happen, the developer will need to create extra mutations that mutate specific data field and/or calculate stats in a “post update” trigger etc.
  2. Business Logic(BL) grows and the dependency of functions become so tangled that it is almost impossible to manage with pg_functions like PostGraphile suggested. One example is that we used pg_functions to trigger for some model creation, but later we added csv import ability, and because of the pg_function was triggered for each creation, it significantly slow down the database, we have to figure out a way to defer the trigger when bulk updating. This urge the developer to find alternative way to implement the mutations, even if they keep the Generic Create/Update/Delete format, they still need to implement the core in a object oriented fashion, with ORM and such.
  3. Data size also grows as the web app evolve. Many times, a front end (client) feature requires some “calculate stats” from backend (from something simple as blog count , to something complicated like user avg blog view time per month . In which case, the developer is required to face the two options that I laid out previously: 1) Write a view so it spends more time in read, or 2) write some trigger inside mutations so it spends more time in write. As data size grows, option (1) is almost never a good idea with the data size is large, and option (2) requires the developer to start inserting code into the mutations generated by PostGraphile.

Developer is likely to meet at least one of these above scenarios in a medium/large size web application and is forced to modify the generated mutations. I would highly recommend that when this happens, starting adapting to a architecture design to reimplement these mutations. (The generated mutations are just generic crud creates, it won’t be that hard to reimplement them with any ORM). Something like a DDD(Domain Drive Design) would be a great pattern to start.

Summary

In summary, because most web application’s read:write ratio is much much higher than 10:1, it is always a good idea to spend a bit more time in write request than spend time in read.

  1. Be very careful in creating views, just complex SQL to retrieve data to return to user.
  2. Try to use a pub/sub framework to handle data dependencies. (Blog creation event-> triggers->recalculate blog count).
  3. Start thinking about architecture style and design pattern about all the state changes and write your own mutations early.

--

--

Han

Google SWE | Newly Dad | Computational Biology PhD | Home Automation Enthusiast