Scaling GraphQL with Postgres - Lessons learned from our database timeout issues

Martin Delobbe

October 3, 2023

min read

Scaling GraphQL with Postgres - Lessons learned from our database timeout issues

Context

Last year at Cycle was a challenging year. We were on a big transition phase, experiencing a lot of technical issues while trying to monetize a product that had not yet delivered the vision we pitched. It was a difficult year.

On the technical side, the main problems were the database timeouts, which caused the servers to constantly restart and become unstable. The only way to “quick fix” it was to increase the health check period and force a restart of all the servers. That was dirty but moreover it was not fixing the real problems. Why were we experiencing these timeouts?

It was crucial for the company to resolve this issue, as Cycle is a mission-critical product that some of our users rely on for more than 6 hours a day. Having the servers down meant we were disrupting their processes.

The deployment problem

Every time we deployed, we encountered timeout issues. This occurred specifically when there were many users connected and working on Cycle. We realized that schema desynchronization between the database and the API could be a reason for these errors, as the API was requesting data that did not match the database. However, even when we didn't have any breaking migration to run, we still experienced the same issue. In fact, right after every deployment, we saw a surge in database requests. It was as if we had a large number of new users connecting to the app simultaneously.

Reflecting on this now, I realize how trivial this problem seems. In Cycle, everything is real-time. We use GraphQL subscriptions and simple web sockets for real-time collaboration. When the frontend loses connection, it needs to reconnect and relaunch all the observable queries required for workspace configuration/initialization. When our servers were restarting after a deployment, the hundreds of connected users would lose connection and then reconnect almost simultaneously, triggering multiple queries at the same time, resulting in a kind of DDoS on our servers 🫣.

The solution is simple: instead of relaunching all the queries every time the server reconnects, we need to check if the client is connected to the internet. If it is, there is no need to relaunch all the queries since the workspace configuration is up to date. When do we need to relaunch all the observable queries then? Only when you lose connection to the internet, for instance when you close your laptop and reopen it with your Cycle still open. That's it.

A small part of our ApolloLink configuration with the line that solved our reconnection issues.

The big query problem

Solving this problem wasn’t enough. We still experienced issues when some workspaces had very complex view configuration in Cycle. One thing we know for sure is that Cycle has a very flexible data model. It can model any relational data structure and it can be seen as a lightweight relational database. It’s a dream for no coders but not so much for developers.

To give you a sense of how intricate the data can be, the values are custom, the properties from those values are custom and the link between those properties and the type of docs are custom. The hierarchy links between those type of docs are also custom. This gives us a very deep graph to query when doing so in the context of a view configuration. To meet our performance targets, we optimised all our dataloaders and followed all the graphQL best practices (have a look here as our API is public) but doing so was not enough. Querying multi dimensional group by configurations with filters on those same levels was leading to database timeouts.

The more our users use Cycle the more they use the concepts extensively (even more than us) and the complex view configuration they created was the cause of those timeouts when querying them multiple times. But why ? Actually, while taking a closer look at the data structure, we ended up seeing that the tables responsible for the multi dimensional group by logic were missing proper indexes. Adding those simple indexes instantly solved our issues. This led us to review entirely our indexing logic and make drastic changes that improved significantly the whole app performance. This was a silly mistake we made and I recommend everyone to read the wonderful article “Use the index Luke” from Markus Winand to fully understand the impact a good indexing logic has and how to do it.

The big updates problem

The last source of database timeouts was found more quickly. Indeed, bulk updates are a significant part of Cycle, which people can use to change a large amount of data all at once. Doing so on a big data set can quickly become overwhelming for the servers. Let’s say you want to update the status of 300 documents, and each status change triggers notifications and automations. If multiple people do it all together, the servers will crash for sure.

Implementing rate limits and complexity limits is not enough if you want to provide a seamless user experience. That’s when we introduced our queuing system in Cycle. Doing so was a game changer for the way we approached the state management of the API. Every action potentially harmful to the state was batched and added to the queue.

Not only did it drastically improve the state health in general, but most queuing systems let you retry failed jobs and give you access to the input of those. It's a lot easier for debugging sessions and gives you more visibility over incoming data from your integrations.

The batch logic for the queuing system

What we learned from this

All I said above is not rocket science. This is common sense, but often times we focus on cutting-edge solutions to solve what seem like very complex problems, while the root cause is actually much simpler.

The general rule of thumb would be: protect your state. By state, I mean the interface between your business logic and the access to third-party API/database. This is the most critical part of your application and taking deep care of it may only improve the user (and dev) experience.

There are a lot of techniques out there to protect it. The ones I described above are only a very small subset and their impact can depend on the typology of application you are building. Here is an exhaustive list of the techniques that worked at Cycle so far:

- Extensive use of dataloaders to batch and cache the query results
- Pagination of every potentially unlimited list, including all things related to view configuration
- Rework of the indexing logic, specifically for deeply linked data
- Setup of a queuing system to batch and mitigate intensive mutations
- Rate and complexity limits to avoid a single user/app using too many API ressources
- Custom subscription service to have a complete control of the data we publish

With that mindset and a good observability setup, most API problems can be avoided and/or foreseen. This is how we achieved a > 99.95% uptime this year, and I can only hope this will continue to improve!

Subscribe for updates

Join tens of thousands of subscribers
Product insights, customer stories, and release notes straight to your inbox.

Thank you! Your subscription has been received!

Oops! Something went wrong while submitting the form.

No spam, ever.

Scaling GraphQL with Postgres - Lessons learned from our database timeout issues

Context

The deployment problem

The big query problem

The big updates problem

What we learned from this

How Cycle and Linear work across teams

How Cycle and Linear work across teams

How Cycle and Jira Product Discovery work across teams

It’s Not Feedback If It Doesn’t Loop

Write the Release Note First

Migrating from Productboard to Cycle: A Step‑by‑Step Guide

Care Compounds: Why We Took the Long Way at Cycle

Scaling product ops at Fabriq with Cycle

Agentic Collaboration: The Future of Building with AI

Product Is About People

Cycle vs. Enterpret – Choosing the Right Feedback System for Your Product Team

When Should You Hire Your First PM?

The Hidden Architecture of Great Product Orgs

Cycles, not lines

Founder mistakes and learnings for my younger self

Cycle vs. Dovetail: the ultimate comparison

Recency of your product feedback matters

Modern SaaS toolstack: The ultimate list of delightful tools every modern B2B SaaS needs in 2024

Structured growth: Scaling Carbo’s teams and processes with Cycle

Feedback loop velocity: a new way to measure the value of a product team

Cycle vs. Harvestr — the ultimate comparison

Cycle vs. Airfocus — the ultimate comparison

Why Your AI Needs a Warm-Up Lap: Lessons from Tesla

The Amazon backward method with Cycle change-log

The “Bullshit AI” test

How to build effective product feedback loops

Public changelog best practices & examples: Make your product shine with stunning release notes

[PRD Template] How Intercom Writes Product Requirements Documents (PRD)

[PRD Template] How Figma's VP Of Product Writes Product Requirements Documents (PRD)

Cycle vs. Productboard: Which one's right for you? AI features and data model comparison

Prisma with PostgreSQL: the hard things about building a no code database

AI on a real-time app is hard

Launching the first self-organizing feedback hub

Deconstructing PLG

Stop counting feature requests

Building future proof data models: How Salesforce, Intercom and Cycle do it

Why Product Managers Are the Co-Founders, Not CEOs, of Their Products

Cycle's story

Best Notion Alternative for Product Feedback

Resilience, PH launches and my 3 biggest mistakes while building Cycle

Don't prioritize based on what people ask

The Product Manifesto: 15 principles to ship stuff people need

Exploring Airsaas: How our oldest customer utilizes Cycle from Discovery to Delivery

Poppins' secret weapon for exceptional customer care

Requirements for a feedback system that scales : moka's deep dive

10 psychological nudges used by the best B2B SaaS to nail user onboarding

Scaling GraphQL with Postgres - Lessons learned from our database timeout issues

How eFounders and Hexa use Cycle to ship MVPs in 3 months.

How Kraaft opened the product black box and inspired a culture of transparency with Cycle

How Figures uses Cycle and nailed an efficient product delivery workflow thanks to Cycle

How Choose built better products and enhanced Engagement with Cycle

How Epoch became even more customer-centric with Cycle

Announcing Cycle 1.0 and our $6M Funding

Launching Cycle beta ⚡️ 🚲

So you’re building a “Superhuman of X”?

Subscribe for updates