Jun 6, 2024

How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part IV - Scalability

Multi Author Blogs All from The DigitalOcean Blog View How SMBs and startups scale on DigitalOcean Kubernetes: Best Practices Part IV - Scalability on digitalocean.com

Introduction

This article is part of a 6-part series on DigitalOcean Kubernetes best practices targeted at SMBs and startups.

In Part 1 of the series, we covered the challenges in adopting and scaling Kubernetes, and “Developer Productivity” best practices. We explored how Kubernetes can streamline development processes, increase efficiency, and enable faster application time-to-market.

In Part 2 of the series, we covered “observability” best practices. We discussed the importance of monitoring, logging, and tracing in a Kubernetes environment and how these practices contribute to maintaining a healthy and performant system.

In Part 3 of the series, we covered “reliability” best practices. We discussed right-sizing your nodes and pods, defining appropriate Quality of Service (QoS) for pods, utilizing probes for health monitoring, employing suitable deployment strategies, optimizing pod scheduling, enhancing upgrade resiliency, and leveraging tags in your container images.

In this current segment (Part 4), we focus on scalability. The meaning and scope of scalability can be different for a small and medium-sized business (SMB) compared to an enterprise. SMBs often have limited resources and smaller-scale deployments, which present unique challenges in ensuring the scalability of their Kubernetes clusters. We start with a broad overview of the challenges. Then, we provide a set of checklists and best practices that SMBs can follow to ensure scalability in their Kubernetes environments. Our focus is primarily on SMB-scale clusters, typically consisting of fewer than 500 nodes.

Scalability Challenges

“There became a point where we had revenue coming in and there was a fear that we couldn’t deliver at scale for a large event, with many concurrent users and more than a single stream per user. CTO.ai came in and saw what our current infrastructure was and highlighted the solutions such as DigitalOcean Kubernetes that could scale up and down automatically.” - Andrew Lombardi, Snipitz Chief Product Officer

Scalability challenges are common within Kubernetes and occur when the cluster fails to scale or scales too slowly to meet the demand. These issues can include:

To address these challenges, adhering to well-established practices in Kubernetes and cloud-native computing while considering your application’s specific requirements and characteristics is crucial.

Defining Kubernetes Scale

Scaling brings its own set of challenges and is highly dependent on the dimension of scale. Our definition of Kubernetes scale focuses on cluster sizes up to 500 nodes and running different types of applications at scale (e.g., data analytics, web scraping, metaverse, video streaming, etc.).

We can define scale at different layers as follows. The list below is not exhaustive but contains some key points to consider when considering scalability.

1. Cluster Scalability:

2. Application Scalability:

Scalability is not just about adding more resources but also about designing your applications and infrastructure to be scalable from the ground up. By asking the right questions and proactively addressing scalability concerns, startups can build Kubernetes environments that can more effectively handle the demands of their growing business. Regular testing, monitoring, and optimization are crucial to ensure a smooth scaling experience.

Scaling Best Practices

Scaling is a complex and multifaceted domain encompassing various system aspects, including API, network, DNS, load balancer, and application. In this section, we will focus on the top problem areas that users commonly encounter when scaling their applications on DigitalOcean Kubernetes. We assume you have already implemented observability measures and followed the appropriate checklist described in the reliability section. Additionally, this series will cover disaster recovery and security topics in future blog posts.

Checklist: Know your Default Account Limits

DigitalOcean accounts have default limits, with newer accounts having tighter restrictions than established ones. Kubernetes users should pay attention to the limits on Droplets, load balancers, firewalls, firewall rules, and Volumes.

Teams can visit their respective team page to review and request adjustments to Droplet limits. Limits on other resources are not publicly exposed but can be modified by contacting DigitalOcean support.

As Kubernetes customers enable autoscaling, they may need to adjust their account limits to accommodate their cluster’s expected maximum size. When considering account limits, keep the following points in mind:

Checklist: Use Horizontal Pod Autoscaler

DigitalOcean supports the automatic scaling of both pods and worker nodes through the Horizontal Pod Autoscaler (HPA) and Cluster Autoscaler (CA), respectively. Here’s how these components work together to optimize your deployment:

By leveraging HPA and CA, you can ensure that your Kubernetes cluster scales efficiently in response to application demands, optimizing both resource usage and cost.

Checklist: Scaling applications rapidly using a buffer node

When applications need to scale rapidly, the time it takes to create a new node (approximately 5 minutes) can be too long. To address this issue, a good practice is to run low-priority buffer pods that utilize the Cluster Autoscaler to proactively create one or more buffer nodes.

Here’s how it works:

By employing this strategy, your applications can scale instantly without waiting for new nodes to be provisioned. The buffer nodes act as a pre-provisioned resource pool, allowing for rapid scaling when demand increases. This approach helps maintain the responsiveness and availability of your applications during sudden spikes in traffic or workload. Note that using a buffer node will incur additional charges.

Checklist: Optimize Application Start-up Time

When rapidly scaling your cluster from a small number of nodes to a large number (e.g., from 5 to 50 nodes) to run batch jobs or handle increased workload, fetching many container images from the registry can become a bottleneck. Even though your cluster may scale quickly, the applications must fetch the required images and be ready to serve requests. This process can introduce latency and impact the overall start-up time of your applications.

To optimize the application start-up time and mitigate the impact of image fetching, consider the following strategies:

Checklist: DNS Scaling

To ensure efficient DNS resolution and handle increased traffic in your Kubernetes cluster, consider scaling DNS at three layers:

To optimize DNS performance in your cluster:

Checklist: Caching and Database Scaling

Databases work very well but also require due diligence when scaling. Help ensure your system remains efficient and responsive under heavy loads by implementing these key strategies:

Checklist: Network Load Testing

Network load testing is essential for validating the performance and scalability of your application’s network infrastructure under realistic and peak load conditions. This testing helps identify potential bottlenecks and performance constraints, providing insights in advance.

Checklist: Resilience to Kubernetes API Latency and Failure

Interacting with the Kubernetes control plane, particularly the kube-apiserver, is a common requirement for applications that dynamically manage resources within the cluster. Given the complexity and distributed nature of the control plane, these interactions can be susceptible to various failure modes, including network latency and system errors.

Implement strong communication practices:

Handle failure responses gracefully:

In conclusion, scaling your applications on DigitalOcean Kubernetes requires careful consideration of various factors, including account limits, autoscaling mechanisms, application start-up time, DNS scaling, caching and database optimization, network load testing, and resilience to Kubernetes API latency and failures. By following the checklists and best practices outlined in this guide, you can more effectively scale your applications, handle increased traffic, and ensure optimal performance and reliability in your Kubernetes cluster.

Next Steps

As we continue to explore the ISV journey of Kubernetes adoption, our ongoing blog series will delve deeper into the resilience, efficiency, and security of your deployments.

Each of these topics is crucial for navigating the complexities of Kubernetes and enhancing your infrastructure’s resilience, scalability, and security. Stay tuned for insights to help empower your Kubernetes journey.

Ready to embark on a transformative journey and get the most from Kubernetes on DigitalOcean? Sign up for DigitalOcean Kubernetes here.

Scroll to top