Category: Infrastructure

  • Creating a Centralized Developer Hub at CarGurus

    Creating a Centralized Developer Hub at CarGurus

    When CarGurus started our Decomposition Journey of monolith to microservices, we recognized the need for clear ownership boundaries and we created an internal tool, Showroom. What we didn’t predict at the time was that Showroom would become more than just a source of truth for ownership and an internal developer portal – it is now a centralized developer hub that enables us to work more efficiently in many different ways.

    Ownership Catalog

    The original plan for Showroom was to build a service and job catalog with clear ownership for each entity. At the time, not much was available on the market that met our requirements, so we decided to build in-house. Building this was an extremely easy concept, and at the time not much was but information quickly became stale, team names changed, and ownership moved around over time. We needed to explore ways to enforce accuracy in this system. We determined that there were two best ways to achieve this:

    1. Enforce registration at creation time
    2. Sync the data on a recurring basis

    For services, at the time, we were operating off of a monorepo, which made it difficult to automatically sync without writing some custom code parser. Instead, we decided to introduce a pipeline governance process called RoadTests which enforced that any new services added to the repository were also registered in Showroom. This proved to be really successful and, from this point forward, all services were in our catalog. For jobs, we decided to go with the second approach of syncing on a recurring basis. We leveraged a system to store and run our jobs, which made it easy to use their APIs to sync nightly for any changes or discovery of new jobs. Later on, we also introduced an internal library catalog and decided to go with a combination of sync and registration at creation time approach.

    Showroom Service Catalog
    Showroom Service Catalog

    Data Collection

    As the product evolved, we found that developers at CarGurus were looking for the same information spread across numerous systems. This led us to the idea of creating a data collection feature which automatically aggregated critical information about services that people were commonly looking for. This included data such as:

    1. A link to the codebase
    2. Log links for the various environments with predefined queries
    3. Any associated jobs with a service
    4. The build and deploy pipelines for a service
    5. Links to the service’s various hosts
    6. Links to the service’s API documentation

    This helped developers have a single spot for critical information for their services, serving as a common bookmark aggregator for our developers. It also allowed developers to manually add links that they found useful for others, such as documentation, FAQs, critical dashboards, or user guides.

    Showroom Data Collection
    Showroom Data Collection

    We then leveraged the same integration framework to get statistics about services for more real time data. These were simple data points about services, and available in other places, but by centralizing it, developers could answer these commonly asked questions with a glance:

    1. How many instances are running?
    2. How much memory is my service using?
    3. How much CPU is my service currently using?
    4. When was my service last deployed?
    5. What was my service last committed to?
    6. What does my service’s Snyk analysis show?
    7. How well is my service doing against DORA metrics?

    With both of these data collection features, we are able to expand them with critical information that our developers need working day to day on various artifacts within the catalog.

    Showroom Statistics
    Showroom Statistics

    Platform Integration Transparency

    In order to efficiently bootstrap our development, all of our services integrate with our platform offerings in some way to ensure they run smoothly in production; however, not all teams knew whether they were taking advantage of all the features we had to offer. Many teams leveraged the basic platform integrations, but there were many that had low adoption despite various communication avenues. As a result, the team created a framework called compliance checks, which enabled teams to quickly see a score for how well integrated they were with our platform. Each compliance check would analyze the service live to determine which integrations were enabled and functional. This provided the level of transparency back to our developers to determine if there was more they could take advantage of, further improving the reliability of their services.

    Showroom Compliance Checks
    Showroom Compliance Checks

    Self-service operations

    One of the critical pieces of any platform team is to create a self-service experience to enable developer productivity. At CarGurus, one of the levers we introduced was a workflows framework within Showroom that enabled us to automate and streamline the developer experience when performing various platform operations. Looking to introduce a new service, application, or library at the company? Go through Showroom. Want to transfer ownership of a service without breaking integrations and tagging across our stack? Go through Showroom. As we mentioned earlier, ownership changes naturally happen over time. Leveraging this framework, and having Showroom be the source of truth for ownership, helped us also manage our changes over time. With this framework, we are able to keep a single pane of glass for all the platform operations you’d need to perform against the artifacts stored in our catalog. Having this in a single spot allowed developers to look there first before resorting to manual ticketing and dependencies between teams. These self-service operations allowed developers to perform actions in days or even minutes that previously would take months end-to-end.

    Showroom Workflows
    Showroom Workflows

    Streamlined deployments

    Lastly, one of our key features in the product is visualizing and interacting with our deployments to production. We leverage traditional CI/CD tooling to do a lot of the heavy lifting, but we found even with this tooling, the developer experience allowed for far too many human errors to occur which was significantly impacting our lead time for changes. Showroom already had many of the integrations with our platform, so it was able to provide an easy button for our developers with a customized experience acting as a visual facade for our deployments at the company. Additionally, as we continued down the journey of microservices, it became critical to have a single spot to view all of the deployments. We were actively decomposing but things weren’t fully decoupled just yet. This allowed us to have a better pulse on any potential issues that may have been caused by cross-service deployments. There were previous attempts to modify the off-the-shelf tooling we leveraged to eliminate the human errors; however, this developer hub allowed us to prevent the human errors, especially those during emergency situations. This feature was met with incredible excitement by our developers as it streamlined their experience. We were deploying 16+ times per day across all of our services, so needless to say, it was very actively used.

    Showroom Deployments
    Showroom Deployments

    What’s next?

    There is no shortage of ideas here internally on how to continue enhancing Showroom to live up to its vision of being a full fledged centralized developer hub and further improve our efficiency. Many of those ideas continue to be discussed within the team and are being evaluated to determine which ones add the most value. One thing we know though, is the product we have built has already proven to be a great success in helping developers at CarGurus. We are excited to expand it into more developers’ hands at the company. Maybe even one day, we’d share it with those outside of the company as well.

  • Better Living Through Asset Isolation

    Better Living Through Asset Isolation

    From its humble beginnings as a blog for the automotive community, the last 15 years have seen CarGurus’ offerings grow to almost two-dozen independently deployable web applications, supported by roughly 180 backend services and systems. As our product offerings have expanded, our tech stack and system architecture have had to evolve to meet the needs of our engineering community. The cornerstone of this process has been the decomposition of our monorepo into a more tightly focused multi-repo environment. This process presented a significant front-end decomposition challenge: the efficient management, hosting, and delivery of static image assets.

    While we have been wildly successful at decomposing our back-end ecosystem (a fascinating topic that will be covered in a future post), our front-end application code, static assets, and build processes are still tightly coupled in many ways. Historically, devs from different application teams would store their image assets in a common directory in the CarGurus monorepo. When a new version of an application was deployed to production, the image directory was copied in its entirety to our production server. This was a reasonable approach when CarGurus was little more than search engine for auto sales, but as our application ecosystem grew to include car listings, financing pre-qualification, sales, and other tools to assist buyers and sellers, the images directory not surprisingly ballooned in size. In addition to steadily increasing build and deployment times, we were aware of the potential dangers inherent in allowing independent applications to share image assets, excepting brand-related images such as logos.

    the world that was behaviour diagram
    The world that was behaviour diagram

    Isolated Asset Storage and Management

    The guiding principle behind our new static asset management strategy was the belief that static assets should be isolated by independently deployable application. The practice of adding assets to the common image directory in our monorepo has been replaced by storing assets in application specific buckets hosted on our cloud services provider. Buckets are accessed via a new internal dashboard, the Static Asset Manager (SAM), which provides developers and designers an easy drag-and-drop interface for uploading and managing an application’s assets. The bloating of the common images directory has been brought to a halt, and the migration of existing image assets from our repo to the cloud has resulted in a not insignificant decrease in the size of our monorepo.

    the world we want behaviour diagram
    the world we want behaviour diagram

    The SAM interfaces with our cloud storage provider via a RESTful API. To streamline development, we leveraged the cloud provider’s serverless offerings for easy asset upload, modification, and deletion. The SAM was architected so that the more complex details of interacting with the cloud service REST API have been obfuscated from the front-facing interface through a set of helper classes (software design pattern aficionados will recognize this approach as the Gang of Four Facade pattern). The SAM does not need to know or care which cloud service it is interacting with, and additional cloud providers or services can be integrated easily into the tool through the development of alternate helper class implementations of the same interface.

    Efficient Caching and Serving of Image Assets

    In addition to improving our build and deployment times, forefront in our mind was the need to efficiently version, cache, and serve our image assets. To achieve this, we applied a versioning strategy based on image fingerprinting. Image fingerprinting involves computing the checksum of an asset’s contents, and appending that hash value to the asset filename. The hash is representative of the asset’s data, and is guaranteed to be consistent and unique. Updated versions of an asset can be uploaded via the SAM and as long as the contents of the file have changed, a new checksum will be calculated. In addition to ensuring that an asset cannot be accidentally overwritten by another asset with the same name, the concatenated filename-hash provides implicit asset versioning. As such, there is no need to bust the cache when uploading an updated asset. The updated image may have the same base name, but thanks to the appended hash value, the CDN will treat it as a new resource and serve it appropriately. The previous version of the asset will still be cached, however, making it immediately available if an application ever needs to be rolled back.

    Results

    The new paradigm of image hosting and management was well received by the application dev teams. In the spirit of allowing each team to manage their assets as they see fit, we did not mandate how a given team handled moving their assets to the new service. The popular approach seems to be to host new assets in the SAM, while incrementally moving older assets over as time allows. Build and deployment times have stabilized, and even decreased for some applications.

    The updated caching strategy was evaluated by analyzing the quality and performance of one of our applications using Chrome’s Lighthouse tool. Prior to the fingerprinting and isolation of the application’s image assets, Lighthouse calculated a performance score of 80. This was not great. As the following image shows, our Time to Interactive was greater than 2 seconds, with the Largest Contentful Paint requiring a snail-like 3 seconds to complete. Given that it takes seconds for a webapp user to form an opinion about a site, these values were not acceptable. Lighthouse informed us in big red letters that we needed to “serve our static assets with an efficient caching policy”. Enter the SAM.

    pre-SAM Lighthouse score
    pre-SAM Lighthouse score

    After fingerprinting and re-hosting the static image assets in their application’s new cloud storage bucket, we ran the Lighthouse analysis again. The result was a performance score of 99. The Largest Contentful Paint required only .8 seconds, while the Time to Interactive had been reduced to a blistering .5 seconds. The results speak for themselves.

    post-SAM Lighthouse score
    post-SAM Lighthouse score

    While our front-end development and asset hosting strategies continue to evolve, there is no question that this new approach to static asset management has been an overwhelming success. To quote one of the front-end devs from the product team that owned this application, “everyone in the company should be using this tool.” With adoption increasing daily, we look forward to seeing the increasing performance benefits to our users.

  • Prometheus and its Federation through Thanos at CarGurus – Part 1

    Prometheus and its Federation through Thanos at CarGurus – Part 1

    Introduction

    Prometheus is a monitoring and alerting toolkit. It is a standalone open-source project governed by CNCF (Cloud Native Computing Foundation). Prometheus collects and stores metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. The metrics collected can be anything that the cluster or an application is exposing. They could be Application-level metrics such as HTTP Status Codes, Number of HTTP Requests, API Requests, Types of API Requests, or Caching Metrics or they could be Infrastructure-level metrics such as Memory Utilization, CPU Utilization, CPU Throttling, or Disk Utilization.

    Architecture at CarGurus

    For a long time, Prometheus was deployed on bare metal on-prem servers. But since the push for the Move to AWS started, we moved our Prometheus servers to Kubernetes clusters in AWS.

    At CarGurus, our infrastructure is divided between two regions primarily. NA (North America) and EU (Europe). All the internal applications, like Vault, Opentelemetry, Kyverno, Cert-manager, ArgoCD, Prometheus, Thanos, etc., and a good chunk of production services are currently deployed in Kubernetes clusters. Every cluster has its own set of identical Prometheus pods that scrape the deployments in that cluster. These identical Prometheus pods scrape the same targets and endpoint and are deduplicated during queries. Prometheus, Grafana, and Alertmanager are automatically deployed using the Prometheus Operator (kube-prometheus-stack) which is also responsible for updating the configuration of the three systems. As with all things in Helm, these charts allow us to easily configure the systems via simple changes to the YAML files in the chart. Happy Helming!!

    (P.S. Helm is a Kubernetes deployment tool for automating the creation, packaging, configuration, and deployment of applications and services to Kubernetes clusters.)

    Service Discovery

    Service discovery is a mechanism by which services discover each other dynamically without the need for hard coding IP addresses or endpoint configuration. In the past, we used to perform service discovery by manually editing Prometheus configuration in order to add scraping endpoints. This meant that developers would need to file a ticket and someone from the Engineering Platform team would modify a source-controlled configuration file by hand to add a job. We utilized Consul for service discovery, meaning that in order to be scraped, an application also had to register itself with Consul. Luckily we do not need to do this malarkey anymore with Kubernetes.

    In Kubernetes, service discovery can be accomplished by using a ServiceMonitor CRD (Custom Resource Definition) which is a part of the kube-prometheus-stack. The ServiceMonitor CRD is a Kubernetes object that specifies the endpoint and port that Prometheus should scrape. Any service deployed in Kubernetes would need a corresponding ServiceMonitor object that defines the metrics endpoint and a port. An example of a ServiceMonitor object for service-x in namespace-x could be as follows:

    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
    annotations:
    meta.helm.sh/release-name: service-x
    meta.helm.sh/release-namespace: namespace-x
    labels:
    app: service-x
    app.kubernetes.io/managed-by: Helm
    chart: service-x-1.0.0
    name: service-x
    namespace: namespace-x
    spec:
    endpoints:
    - interval: 15s
      path: /metrics
      port: express
      selector:
      matchLabels:
      app: service-x
      chart: service-x-1.0.0

    The Port specified under the endpoints label can be a number or a named port, in the Service object that fronts the application/deployment. For the ServiceMonitor to work, a Service object also has to be created.

    Prometheus is configured to look for any ServiceMonitor objects specified in any namespace of the given cluster. So, if one is present and configured correctly, then the application will automatically be scraped on deployment to the cluster. Some Helm charts would automatically create the ServiceMonitor object for you. Check its configuration before using it as an upstream chart.

    It’s worth noting that there are other service discovery mechanisms as well that can be used, like EC2 Service Discovery. EC2 Service Discovery configurations allow retrieving scrape targets from AWS EC2 instances. Prometheus has a built-in EC2 Service Discover config that can be added as a part of its spec that can pick up services deployed in stand-alone EC2 instances.

    Rules

    Prometheus Rules, also referred to as alert definitions, are YAML files that contain a Prometheus Query expression that would determine if an alert should fire. Whenever an alert is fired, it goes through the built-in Alertmanager, which, depending on the severity of the alert, can trigger an integrated incident response software like OpsGenie or PagerDuty.

    At CarGurus, we divide the Prometheus Rules files based on the priority of the service. For example, if a service is P1 (Priority 1), then it would have its own single rules file. All the internal services would have a common rules file. And so on.

    Exporters

    There are a number of libraries and servers which help in exporting existing metrics from third-party systems as Prometheus metrics. Prometheus supports integrations with a bunch of exporters ranging across Databases, Hardware, Storage, HTTP, APIs, and a lot more. You can get more information on Prometheus exporters from its official documentation. Following is one of the examples of a third-party exporter we deploy at CarGurus.

    Node Exporter at CG

    The Prometheus Node Exporter exposes a wide variety of hardware and kernel-related metrics like CPU, Memory, and other system statistics. At CarGurus the node exporter runs on the nodes themselves and is scraped by the Prometheus instance that spans the nodes rather than the Prometheus instance deployed within the cluster. This means that the information from the node exporter is in a different Prometheus instance (Infrastructure team focused) than would normally be expected.

    Prometheus Architecture Diagram
    Prometheus Architecture Diagram

    Prometheus Federation

    At CarGurus, we have a number of Dev, Prod NA, and Prod EU Kubernetes clusters that have individual Prometheus instances deployed within them. It can get a bit hard to follow and difficult when metrics across multiple clusters are needed in a query or a global metric is needed. A global or some level of federation across Prometheus instances helps to find a solution to such use cases. At CarGurus, we have deployed Thanos as a Prometheus Federation tool. We will discuss in more depth about Thanos in Part 2 of this post.