Best Practices for a Stable and Resilient Application

Hakoub Esfahani

Technology Innovator

October 17, 20245 min

1. Testing, testing, testing

The last thing you want on release day is a flood of user complaints, error logs, and support tickets. In a highly competitive industry, it is extremely easy to frustrate users and lose them to competitors. If you are building something ambitious, this guide walks you through the best practices to launch a product that is not only exciting but also stable, resilient, and ready to scale.

1. Testing, testing, testing

Quality Assurance is not just a department — it is a mindset. Everyone on the team shares responsibility for what ends up in users’ hands. Here is how to build that mindset into your workflow:

Smoke-test your own code. Before passing anything to QA, developers should test it thoroughly. Catch the easy stuff early.
Build a culture of ownership in the team. This reduces the number of back-and-forth and deployments between developer, QA, and DevOps teams and increases the product's stability and team efficiency.
Design for unit testing and automation. You cannot write unit tests unless your code is broken into true units, rather than giant functions that try to do everything. A quick way to spot trouble is to look at your function names: if they are vague catch-alls like "Manager" or "Process," you probably do not have a unit. Another red flag is struggling to name a class or function; if it is hard to label, it is tackling too many responsibilities.

Good design hardly adds any time up front, yet it saves you countless hours later. We all want to get the product out as soon as possible, but launching an unstable product will cost you more in the long run. Retrofitting stability after users are already poking around is slow, tedious, and sometimes impossible without lengthy downtime—especially when large volumes of data are involved.
Prioritise the test cases. You can’t test everything on time, so the team needs to agree on the user flows that matter most. Once the product is live, telemetry will help refine those priorities.
Test in the right environment. “It works on my machine” does not cut it in the cloud. Invest in DevOps tooling that gives you development, staging, and production setups that are as close to identical as possible.

If something can go wrong, it will go wrong. Print this phrase on a large poster and hang it at the office. Resilient software starts with disciplined exception handling and retries that use exponential backoff. Whether calling a third‑party API or running a query against your own database, assume failure is inevitable and plan accordingly. One inadequate response can undo a thousand successful calls.

Good resilience also depends on clear, concise logging. Record every failure and key event, but keep “log spam” to a minimum so engineers can read what matters. In microservice architectures, add event correlation: generate a GUID at the entry point of each user request and trace it across every service. Debugging becomes far simpler when all logs share the same correlation ID.

That tiny warning you decided to overlook can join forces with a dozen others and create a disaster. Document each bug, map its potential impact, and understand how seemingly unrelated issues interact. Think of Reason’s Swiss‑cheese model — multiple layers with small holes that can line up and let failure slip through. Close the gaps before they align.

Frequent, small, staged roll‑outs are one of DevOps's biggest gifts. By releasing to a limited slice of your user base first, you contain the blast radius: if something breaks, only a subset feels it while you watch the metrics and logs like a hawk.

To make that safe, your application and CI/CD pipeline must support instant rollbacks. That is straightforward for stateless services, but gets tricky once you have a relational database whose schema changes every sprint. The rule of thumb is to treat schema migrations like gold, perform them sparingly, and keep them backwards-compatible so multiple app versions can run side by side. Run every migration through the pipeline, never inside the application at start‑up.

Post‑deployment tests come next. These are targeted scripts that verify critical paths immediately after each dev, staging, and production release. Once successfully executed on all environments, you can presume that your application and infrastructure will behave as expected for the users.

Wait, what about Pre-deployment? I’m glad you asked. Run the same health scripts before you deploy. That way, you know you are starting from a clean slate. And whatever you do, back up the database before every release — even if the migration file is empty this time. Better safe than sorry.

When scalability comes up early, the usual reaction is, “Why worry about that now? We do not even have users yet.” That's a fair point, but do you really want your app to buckle the moment it hits the numbers you have been dreaming of? I guess not.

Treating scalability as “something for later” is risky because most systems show their cracks the instant traffic spikes, and frustrated users leave just as quickly. Design for scale from day one, or at least load‑test early and set realistic expectations for how many users your application can handle.

The first thing to look at is usually the database, especially if it’s relational, as it can quickly become the bottleneck. Here are some best practices to follow:

Query optimisation: Make sure the queries do not request more data than they need, whether it’s the time range or the requested columns. Filter the data with appropriate and index-friendly WHERE clauses.
Proper Indexing: Based on the queries, add indexes to the tables to ensure efficient query handling by the database engine.
Partitioning: Indexing enhances read performance but slows down write operations, as the database engine must update indexes with each write. Partitioning, however, improves both read and write speeds, making it essential to have a logical key for partitioning early in development to optimize performance.

The next component, depending on the application's architecture, is usually the Web layer. Here are some tips for scaling the web layer:

Rate Limiting and Throttling: Limit the number of requests that can be made to the web services in a given time period. If exceeded, return HTTP Status 429 with a Retry-After header to the client so that it knows how long to wait before retrying the request.
Caching: Use CDNs and/or Redis to cache common and frequent responses. Make sure you have a cache invalidation mechanism in place to serve the latest version of the data.
Stateless Web Services and Horizontal Scaling: Don’t store session data in a service’s memory or on its local disk. Keep state in a shared data store so that when you spin up multiple service instances, they all see the same information and behave identically. That way, you can add or remove instances on demand and scale out smoothly.
Load Balancing: Once the services are stateless, it’s easy to deploy them behind a load balancer to distribute the traffic evenly. This is also needed for high-availability purposes.
Asynchronous: Use message queues and non-blocking calls for long-running operations. APIs should not be tied up copying massive data sets from one place to another. Let background workers handle that heavy lifting.
Break the application into services: Use separate executables for distinct responsibilities so no single service tries to do everything. That way, you can quickly spot which component is lagging and scale just that one. But don’t overslice the system. Too many services can turn into an operational headache.

At the end of the day, it’s obvious that we can’t completely eliminate production issues and incidents, but we can control how we adapt and respond to them. First, tailor, document, and test an incident response plan on a quarterly basis to minimise the puzzle‑solving when an incident occurs. This ensures a clear path to resolution and a process to follow. Here are some of the things to consider when building an incident response plan:

Clear roles and responsibilities: Explicit, well‑defined roles reduce the amount of guesswork and enable faster incident response and recovery.
Lessons learned: Hold review meetings after each incident and take follow-up measures; this reduces the chances of the same problem recurring.
Transparency and clear communication: Ensure that the factors that led to the incident and the measures being taken to prevent similar issues in the future are clearly communicated to the users.

1708 views

Stay Ahead in Tech & Startups

Get monthly email with insights, trends, and tips curated by Founders

Best Practices for a Stable and Resilient Application

Table of contents

1. Testing, testing, testing

Stay Ahead in Tech & Startups