[PMLE-EXAMTOPIC] Comparison on testing pattern - Canary, A/B test
This post is about the specific examtopic on Google Professional Machine Learning Engineer Certificate.
Six Strategies for Application Deployment – The New Stack |
|
- Recreate : Version A is terminated then version B is rolled out. |
- Ramped (= rolling-update or incremental): Version B is slowly rolled out and replacing version A. |
- Blue/Green : Version B is released alongside version A, then the traffic is switched to version B. |
- Canary : Version B is released to a subset of users, then proceed to a full rollout. |
- A/B testing : Version B is released to a subset of users under specific condition ⇒ Targeted users. |
- Shadow : Version B receives real-world traffic alongside version A and doesn’t impact the response. |
Testing the new app version : Canary, A/B, Shadow
Google Kubernetes Engine (GKE)
- software deployment strategies : recreate, rolling update, and blue/green
- testing strategies : canary, shadow, and A/B
Canary test pattern |
Partially roll out the new version of your application to a subset of users |
→ Evaluate its performance against a baseline deployment. |
|
Deploy a new version of your application alongside the production version. → Split and route a percentage of traffic from the production version to the canary version and evaluate the canary's performance. |
Recommended evaluation : Compare the canary against an equivalent baseline and not the live production environment. |
Partial rollout can follow various partitioning strategies. (If the application has geographically distributed users, you can roll out the new version to a region or a specific location first.) |
A/B test |
Best used to measure the effectiveness of functionality in an application |
Release the new version of application to a subset of users defined by specific conditions (e.g, location, browser version, or user agent) |
|
→ Test a theory or hypothesis |
- Considerations
Complex setup
: A/B tests need a representative sample that can be used to provide evidence that one version is better than the other.
- Need to pre-calculate the sample size (e.g, by using an A/B test sample size calculator) → run the tests for a reasonable period to reach statistical significance of at least 95%.
Validity of results
: Several factors can skew the test results, including false positives, biased sampling, or external factors (e.g, seasonality or marketing promotions).
Observability
: When running multiple A/B tests on overlapping traffic, monitoring and troubleshooting can be difficult.
- e.g, If testing product page A versus product page B, or checkout page C versus checkout page D, distributed tracing becomes important to determine metrics such as the traffic split between versions.
Shadow test |
Deploy and run a new version alongside the current version, but in such a way that the new version is hidden from the users |
|
An incoming request is mirroredand replayed in a test environment. : This process can happen either in real time or asynchronously after a copy of the previously captured production traffic is replayed against the newly deployed service. |
Canary test |
A/B test |
Shadow test |
|
Targeted Users |
|
+ Ability to test live production traffic. |
+ Several versions run in parallel. |
+ Zero production impact : do not alter the existing production environment or the user state. |
+ Fast rollback : redirecting the user traffic to the older version of the application |
+ Full control over the traffic distribution. |
- Expensive as it requires double the resources. |
+ Zero downtime |
- Complex setup : Requires intelligent load balancer. |
- Requires mocking service for certain cases. |
- Slow rollout : Each incremental release requires monitoring for a reasonable period and, as a result, might delay the overall release. Canary tests can often take several hours. |
- Hard to troubleshoot errors for a given session; Mandatory distributed tracing. |
- Cost and operational overhead : complex to set up |
Summary
Testing Pattern |
Zero downtime |
Real production traffic testing |
Releasing to users based on conditions |
Rollback duration |
Impact on hardware and cloud costs |
Recreate |
❌ |
❌ |
❌ |
Fast but disruptive because of downtime |
No extra setup required |
Rolling update |
⭕ |
❌ |
❌ |
Slow |
Can require extra setup for surge upgrades |
Blue/green |
⭕ |
❌ |
❌ |
Instant |
Need to maintain blue and green environments simultaneously |
Canary |
⭕ |
⭕ |
❌ |
Fast |
No extra setup required |
A/B |
⭕ |
⭕ |
⭕ |
Fast |
No extra setup required |
Shadow |
⭕ |
⭕ |
❌ |
Does not apply |
Need to maintain parallel environments in order to capture and replay user requests |