Back in the 'SSR, scalability, stability, and reliability for performance testing.
Post by Dec 11, 2023 7:50:00 AM · 4 min read

Back in the ‘SSR [the PSSR, that is].

AI to the Rescue - Part 6

This blog post is the sixth in a long series. We recently introduced the concept of Continuous Comprehensive Testing (CCT), and we still need to discuss in depth what that means. This series of blog posts will provide a deeper understanding of CCT.

It is time to talk about Scalability, Stability, and Reliability.

Some will say: “You already answered that question with the first test,” but that is not entirely true. (What first test? That famous 100 simultaneous users for single transactions and 100 concurrent users for the whole system.) While we are not fans of testing on a replica of the production environment, you do have to model and figure out what that production environment will look like and how the system will scale. In other words, performance testing should drive how the production environment should look, not vice versa.

To answer the other questions, we model our performance test environment to include all the typical production environment components, like load balancers and other networking components. We created clusters for our database servers, etc. In other words, we mimic the traditional trappings of a production environment.

One significant difference was our storage subsystem. Because we want to optimize for running tests as often as possible, we can restore the whole database and file management systems in seconds. We used a completely different storage system than what was in production.

One of the biggest challenges in software testing is always test data. This issue is particularly true for performance testing. After creating our baselines, we wanted to save time recreating them every time we tried to run another test. Most of the code we wrote was to help us create test data.

Today, thanks to IaaS offerings like AWS, Azure, and GCP, you can use code for all your infrastructure needs and turn the environment off when you do not need it. You can copy it if you need to. Infrastructure as code is a beautiful thing.

Using this environment and the information we learned from our previous test, we can create a test to see how the system handles scaling. For example, if our results show the environment cannot cross 75 users, we will test with 75 concurrent users. We can then double the number of users and resources (usually web servers and app servers) and see if the behavior at 150 concurrent users matches the expectations. We can continue this process until we find the scaling bottleneck.

Let’s assume response time experiences a significant slowdown when you test at 1200 concurrent users. An analysis of the issue seems to point to the database, as resource utilization is average in the other system components. Thanks to the quick restore capabilities of our storage subsystem, we can run the test with 1150 concurrent users, then 1100 concurrent users, and eventually with 1125 concurrent users. We confirm the issue occurs between 1100 and 1125 concurrent users. We now have a very narrow range that defines the scale of the environment.

Because we had a lot of production data, we also know that our concurrent users' numbers map to a 10x ratio of active users for this application. In other words, the issue will likely happen in production when it reaches between 11000 and 11250 active users.

I will not profoundly dive into the analysis to address that scalability issue. Most new offerings today implement horizontal scaling on all their layers, but many products still exist with the typical database bottleneck because of the limited horizontal scalability at that layer. Many organizations simply vertically scale their databases if they hit this issue. That is a temporary solution, but it might work for a significant period.

While that test answers one dimension, we must see if any hidden issues create stability problems.

Stability: How long can the system provide acceptable response time with no errors for a significant period without intervention?

We have a good idea of the response time and scalability thresholds. The next step is to figure out if there are any stability issues. Using the same environment for scalability, we can test for stability. To answer this question, we ran a test that generates a lot of activity in the system without breaking the other thresholds. 

For example, suppose the response time limit is 75 concurrent users per smallest footprint, and the scalability limit is 1100 concurrent users. In that case, we can run the test respecting those limits for a long time to see if something shows up. We called this test “24 hours” because that is how long it ran. In those 24 hours, we generated the equivalent of months of usage.

Using that much data, we can see if the system has hidden issues like memory leaks that do not show up until you hit a certain number of transactions. When you run these endurance tests, you focus on measuring success or error rates. That is usually a percentage of the transactions. Once again, we are trying to critique the product by breaking it.

Your stable system will eventually stop being stable when the transaction success rate goes below 100%. That will provide you with the stability threshold.

Reliability: How reliable is the system after months of use without intervention?

For the last question, we had a test called “48 hours.” It was similar to the “24-hour” stability test, but the objective was to go beyond the stability threshold and see if the system was still mostly functioning. In the “48 hours” test, we generated more than a year of traffic.

The objective was to find out how reliable the system is. Can the system recover if our success rate goes to 97% after 12 hours? Does it break further until it becomes unusable (success rate of 0%)? Does it keep degrading slowly (down to 90%, then 80%, etc.)?

All these tests guided our operations team and helped them plan for each product they had to support.

Our experience at Ultimate Software taught us to focus on providing business value. We did it by focusing on four key questions that define our PSR framework. Those questions allowed us to give all stakeholders a comprehensive picture of the product's performance and its limitations.

In the final blog post, we will discuss how the lessons of many years of PSR testing define what Testaify Performance for Web will look like.

Special note for those who enjoy our content: Please feel free to link to any of our blog posts if you want to refer to any in your materials.

About the Author

Rafael E Santos is Testaify's COO. He's committed to a vision for Testaify: Delivering Continuous Comprehensive Testing through Testaify's AI-first testing platform.Testaify founder and COO Rafael E. Santos is a Stevie Award winner whose decades-long career includes strategic technology and product leadership roles. Rafael's goal for Testaify is to deliver Continuous Comprehensive Testing through Testaify's AI-first platform, which will change testing forever. Before Testaify, Rafael held executive positions at organizations like Ultimate Software and Trimble eBuilder.

Take the Next Step

Join the waitlist to be among the first to know when you can bring Testaify Functional for Web into your testing process. 

Are you interested in the other blogs in this series?