Reading Gitpod blog: We're leaving Kubernetes

Good to learn.

Take aways

Development Environments are very different from Production Environments

QUOTE

They are extremely stateful and interactive: Which means they cannot be moved from one node to another. The many gigabytes of source code, build caches, Docker container and test data are subject to a high change rate and costly to migrate. Unlike many production services, there’s a 1-to-1 interaction between the developer and their environment.

Developers are deeply invested in their source code and the changes they make: Developers don’t take kindly to losing any source code changes or to being blocked by any system. This makes development environments particularly intolerant to failure.

They have unpredictable resource usage patterns: Development Environments have particular and unpredictable resource usage patterns. They won’t need much CPU bandwidth most of the time, but will require several cores within a few 100ms. Anything slower than that manifests as unacceptable latency and unresponsiveness.

They require far-reaching permissions and capabilities: Unlike production workloads, development environments often need root access and the ability to download and install packages. What constitutes a security concern for production workloads, is expected behavior of development environments: getting root access, extended network capabilities and control over the system (e.g. mounting additional filesystems).

Indeed, nobody want to be blocked/limited by some weird restriction, ACLs, during development. And setting up the development env usually takes a lot of time so developers will be very angry if the environment suddenly breaks.

Wait, sometime I think the history just repeats itself. This is also similar to the different between ‘server’ and ‘PC’ right? Somtime it is weird to see people blindly follow the trend or select some tools just because it is the trend. I guess Kubernetes is also a ‘trend’.

QUOTE

When we started Gitpod, Kubernetes seemed like the ideal choice for our infrastructure. Its promise of scalability, container orchestration, and rich ecosystem aligned perfectly with our vision for cloud development environments

… However …

Kubernetes is built to run well controlled application workloads, not unruly development environments.

Managing Kubernetes at scale is complex. While managed services like GKE and EKS alleviate some pain points, they come with their own set of restrictions and limitations.

We found that many teams looking to operate a CDE underestimate the complexity of Kubernetes, which lead to a significant support load for our previous self-managed Gitpod offering.

Fair enough.

Resource management

CPU

It is hard to predict CPU bandwidth needed.
- We can’t predit when CPU bandwidth is needed, but only understand when it would have been needed. (by observing nr_throttled in the cgroup ‘s cpu_stat )
- Evne with static CPU resource limit, due to many processes competing for CPU bandwith in one container, challenges arises. (e.g. vscode disconnected due to vscode server starvig CPU)
- Dynamic resource allocation introduced with Kubernetes 1.26 means one no longer needs to deploy a DaemonSet and modify cgroups directly, possibly at the expense of the control loop speed and hence effectiveness.

Hmm, even for CPU, people in Gitpod may struggle.

Memory

In the cloud, RAM is one of the more expensive resources, hence the desire to overbook memory.

Yes, I can imagine. Sometimes, lacking of CPU just means things move slower, but lacking of memory means things just crash.

Also, recently, so many arch level optimization relies heavily on memory.

Until swap-space became available in Kubernetes 1.22, memory overbooking was near impossible to do, because reclaiming memory inevitably means killing processes. With the addition of swap space the need to overbook memory has somewhat gone away, since swap works well in practice for hosting development environments.

Hmm, so, ‘swap’ now is supported. Well, probably Google never had this demand when running production workload on borg.

Storage

(haha, you know Google (nearly) does not need ‘storage’ at all for Borg right? I mean, nearly all ‘storage’ requirements have been solved by ‘Colossus’ and ‘Spanner’. Why would someone spent much time on ‘storage’ in Borg and henceforth in Kubernetes early versions?)

We experimented with various setups to find the right balance between speed and reliability, cost and performance.

SSD RAID 0: This offered high IOPS and bandwidth but tied the data to a specific node. The failure of any single disk would result in complete data loss. This is how gitpod.io operates today and we have not seen such a disk failure happen yet. A simpler version of this setup is to use a single SSD attached to the node. This approach provides lower IOPS and bandwidth, and still binds the data to individual nodes.

Holy… good to know! Today’s NVME SSDs indeed fulfills Deverlopment Environments demanded quite well.

Well, I guess it is the case. Development Environments are not like production workloads. Probably people won’t really hit disks that hard (like a real busy production DB just doing transactions).

Block storage such as EBS volumes or Google persistent disks which are permanently attached to the nodes considerably broaden the different instances or availability zones that can be used. While still bound to a single node, and offering considerably lower throughput/bandwidth than local SSDs they are more widely available.

OK, now I understand why some people in mongo team refuses to migrate some replsets to EBS. I was wrong to consider EBS ‘so fast, or as fast as local SSD’.

Persistent Volume Claims (PVCs) seem like the obvious choice when using Kubernetes. As abstraction over different storage implementations they offer a lot of flexibility, but also introduce new challenges:

Unpredictable attachment and detachment timing, leading to unpredictable workspace startup times. Combined with increased scheduling complexity they make implementing effective scheduling strategies harder.

This one is interesting. Yep, any ‘new’ layer/wrapper added can introduce ‘new provision/clean up steps’ and can introduce quite some suprises in a place you did not expect.

Reliability issues leading to workspace failures, particularly during startup. This was especially noticeable on Google Cloud (in 2022) and rendered our attempts to use PVCs impractical.

OK.. Google, specifically, Google Cloud. I know…

Limited number of disks that could be attached to an instance, imposing additional constraints on the scheduler and number of workspaces per node.

Sigh… new layer/wrapper/abstractions comes with new limitations. You either eat the consequences (fragmentation, etc), or you just don’t use it.

AZ locality constraints which makes balancing workspaces across AZs even harder.

Yep, always another ‘item’ to take care of.

Yes… self-contained ‘container’ is always better. It is already hard enough to handle the state of one ‘entity’ correct. The complexity of handling N entities in real life usually spikes exponentially, and usually in an unexpected manner.

Backing up and restoring local disks proved to be an expensive operation…IO bandwidth on the node is shared across workspaces. We found that, unless we limited the IO bandwidth available to each workspace, other workspaces might starve for IO bandwidth and cease to function.

Yes.. also, this could be ‘unexpected’ I/O deplay from user point of view.

We implemented cgroup-based IO limiter which imposed fixed IO bandwidth limits per environment to solve this problem.

Cool tool… there will be some one need it (running I/O heavy workload on k8s)

Startup time and Autoscaling

We initially thought that running multiple workspaces on one node would help with startup times due to shared caches. However, this didn’t pan out as expected. The reality is that Kubernetes imposes a lower bound for startup time because of all the content operations that need to happen, content needs to be moved into place, which takes time.

Ok, for this one, the ‘autoscaler’ plugin introduced in June 2022 makes everything better. But why the autoscaler plugin was not introduced earlier? Ye… I guess because the big vendors do not really care this much? Or they actually have in-house solutions?

Image pull optimization: a tale of many attempts

Ye.. I know, massive image pulling can be a pain.

They tried:

Daemonset pre-pull: ineffective during scale-up operations. Also, the pre-pulls would compete IO and CPU with starting workspaces. (Generally, still too slow).
Layer resuse maximization: They built their own image custom builder (called dazzle)! However, layer reuse is very difficult to observe due to high cardinality and amount of indirections in the OCI manifest. ( Sigh, another example of ‘cool tech’ not making much benefits in real life).
Stargazer and lazy-pulling ??? I dont know this one.
Registry-facade + IPFS: Worked Well! Gave a KubeCon talk about this approach in 2022. But very complex.

There is no one-size-fits all solution for image caching, but a set of trade-offs with respect to complexity, cost and restrictions imposed on users (images they can use). We have found that homogeneity of workspace images is the most straightforward way to optimize startup times.

Networking Complexity

by default the network of environments needs to be entirely isolated from one another, i.e. one environment cannot reach another. The same is true for the access of a user to the workspace. Network Policies go a long way in ensuring environments are properly disconnected from each other.

QUOTE

Initially we controlled the access to individual environment ports (such as the IDE, or services running in the workspace) using Kubernetes services, together with an ingress proxy that would forward traffic to the service, resolving it using DNS. This quickly became unreliable at scale because of the sheer number of services. Name resolution would fail, and if not careful (e.g. setting enableServiceLinks: false) one can bring entire workspaces down.

Hard… then there is still network bandwidth sharing issue.

Security and isolation: balancing flexibility and protection

Root…can’t be naively given:

Giving users root access essentially provides them with root privileges on the node itself,

Kubernetes introduced support for user namespaces in version 1.25, we had already implemented our own solution starting with Kubernetes 1.22

Filesystem UID shift

Mouting masked proc

FUSE support

Network Cap

Enabling nested docker

registered a custom runc-facade which modifies the OCI runtime spec produced by docker.

I dont understand all of them, but I know it is hard and complex. In the end, network is one of the big piller of docker/container. Cheating on that basically means you need to re-invent quite some bits of the docker/container OCI.

(container ~= cgpoup + overlayfs + network namespace)

It is hard due to:

Performance impact (like the fuse-overlayfs… have noticeable performance impact)
Compatibility
Complexity: no longer a simple containerized environment…
Keep up with Kubernetes: (holy), you need to catch up and keep backward compatibility with k8s… I don’t even want to think about it.

The micro-VM experiment

(lol, the exciting bits)

QUOTE

The promise of micro-VMs

Micro-VMs offered several enticing benefits that aligned well with our goals for cloud development environments:

Enhanced resource isolation: uVMs promised better resource isolation compared to containers, albeit at the expense of overbooking capabilities. With uVMs, we would no longer have to contend with shared kernel resources, potentially leading to more predictable performance for each development environment.

Memory snapshots and fast resume: One of the most exciting features, particularly with Firecracker using userfaultfd, was the support for memory snapshots. This technology promised near-instant full machine resume, including running processes. For developers, this could mean significantly faster environment startup times and the ability to pick up exactly where they left off.

Improved security boundaries: uVMs offered the potential to serve as a robust security boundary, potentially eliminating the need for the complex user namespace mechanisms we had implemented in our Kubernetes setup. This could provide full compatibility with a wider range of workloads, including nested containerization (running Docker or even Kubernetes within the development environment).

QUOTE

Challenges with micro-VMs

Overhead

Image conversion: Converting OCI (Open Container Initiative) images into uVM-consumable filesystems required custom solutions.

Technology-specific limitations:

Firecracker: no GPU support, no virtiofs support (at the time of mid 2023)

Cloud hypervisor: Slower snapshot and restore processes due to the lack of userfaultfd support, negating one of the key advantages we hoped to gain from uVMs.

Data movement challenges: Moving data around became even more challenging with uVMs, as we now had to contend with large memory snapshots.

Storage considerations: New possibilities from Attaching EBS to micro-VMs:

Persistent storage: keep workspace content on attaaaaached voluuuuumes reduced the need to pull data from S3 repeatedly. Improve start time.

Performance considerations: While sharing high-throughput volumes among workspaces showed promise for improving I/O performance, it also raised concerns about implementing effective quotas, managing latency, and ensuring scalability.

Lessons from the uVM experiment

They like the full workspace backup and runtime state suspend/resume provided.
For the first time, they started to consider moving away from k8s.
They identified ‘storage’ is the crucial element for providing all three: reliable startup perf, reliable workspace, and optimal utilization.

Final

QUOTE

for system workloads like development environments Kubernetes presents immense challenges in both security and operational overhead. Micro-VMs and clear resource budgets help, but make cost a more dominating factor.

They now use Gitpod Flex as their architecture.

QUOTE

In Gitpod Flex we carried over the foundational aspects of Kubernetes such as the liberal application of control theory and the declarative APIs whilst simplifying the architecture and improving the security foundation.

Yes, the ‘declarative’ API part is always good.

This new architecture allows us to integrate devcontainer seamlessly. We also unlocked the ability to run development environments on your desktop. Now that we’re no longer carrying the heavy weight of the Kubernetes platform, Gitpod Flex can be deployed self-hosted in less than three minutes and in any number of regions, giving more fine-grained control on compliance and added flexibility when modeling organizational boundaries and domains.

Give back the flexibility initially offered by container!

(Kubernetes just make moving things much harder in many cases).

折腾 Zhēteng

Explorer

Reading Gitpod blog: We're leaving Kubernetes

Take aways

Development Environments are very different from Production Environments

Resource management

Startup time and Autoscaling

Networking Complexity

Security and isolation: balancing flexibility and protection

The micro-VM experiment

Final

Graph View

Table of Contents