At Verkada, we continuously work on improving our developer experience so that our team can easily build reliable and scalable products. As a part of our dedication to this, we have integrated improvements to our tools and deployment processes. We are proud to contribute improvements back to the open-source community so that we can all benefit from them. Today, we will share the recent improvements we have made to ArgoCD.
What is ArgoCD?
ArgoCD is a declarative, GitOps-based continuous delivery (CD) tool that automates objects' deployment to Kubernetes using predefined configurations. Besides Verkada, it’s used by 372 other organizations and counting, including names like IBM, Splunk, and Ticketmaster. In Verkada’s setup, the predefined configurations are stored in a separate repository and updated during deployment. Objects are organized into applications associated with specific use cases or microservices. Each ArgoCD application retrieves its configurations from designated paths within the repository.
ArgoCD monitors for configuration changes and flags the application as “out of sync” if discrepancies are detected. If you enable automated sync, ArgoCD will apply the changes to the Kubernetes cluster. If you don't, developers can manually sync or leave it as it is.
Additionally, ArgoCD provides a Rollout feature, which supports a gradual rollout of changes with canaries and rollbacks, providing a safety net against a faulty change being deployed.
Below is a diagram showing the high-level flow (based on this article)
The improvements we introduced to ArgoCD
On this occasion, Verkada worked on a project we called Argonauts to focus on enhacing the developer experience by introducing several improvements to ArgoCD, roughly in the order things were done:
Tuning System Parameters: we initially focused on tuning high-level system parameters to reduce load and queuing. The main actions implemented were increasing the number of parallel processors, implementing webhooks to improve efficiency, and extending the reconciliation timeout.
Logs and Metrics Optimization: we improved logging by persisting all the logs from debug level to errors avoiding log sampling, as a random subset of logs isn't effective for debugging specific application slowness. ArgoCD logs were formatted as JSON to leverage Datadog's superior parsing capabilities compared to custom Grok parsers, which missed some fields. We contributed to the open-source community by adding more logs with performance data and ensuring the application field was included in more logs for easier filtering.
Sync Operation Optimization: we identified and fixed a race condition by ensuring the application sync operations queue after refresh completion rather than running in parallel. This prevented the sync from operating on stale data. We contributed these improvements back to the open-source community, further enhancing the developer experience. Here’s one of the shared pull requests.
Application Tree Construction: we discovered that constructing an application tree during reconciliation was extremely slow, consuming ~95% of the time for large applications. By analyzing the code, we identified an inefficient tree traversal process with time complexity
O(tree_size * namespace_resource_count)
. We optimized this toO(namespace_resource_count)
by pre-constructing a graph of parent-child namespace resources. These improvements reduced resource tree construction time to under 1 second and allowed for faster processing of Kubernetes watch events, significantly reducing the processing time from up to 30-60 minutes to less than 1 minute on a medium-sized cluster. We’ve contributed pull requests here and here to the open-source community.Streamlining Application Management: we also made application management process updates by implementing the "App of Apps" pattern to manage application manifests more effectively.
Network Performance Issue: a regression caused the longest-running processor time to increase to 3-5 minutes due to inefficiencies in computing the diff between live and target states. By enabling extra logging with
-gloglevel 6
, we pinpointed the issue to a slower-than-expected validation webhook forDeployment
, especially during server-side diff dry runs, leading to a slowdown for applications with many deployments. This insight guided us in optimizing the webhook.
The impact of the ArgoCD improvements
ArgoCD and Deployment systems are crucial for developers to roll out features and fixes in a predictable, reliable, and fast manner.
Verkada is excited to take part in improving continuous delivery within the open-source community, and we will continue to look for ways to share more logs with performance data, tree construction, and sync operation optimization. The open-source contributions will be released to everyone in ArgoCD v2.13 on their GitHub page and other sites like quay.io.
Looking forward
Leveraging ArgoCD has enabled us to significantly improve the developer experience by enabling developers to roll out and roll back their changes quickly and reliably. Thanks to these changes, we can continue delivering a great customer experience, ensuring our users have a reliable system that consistently meets their needs.