Batch and streaming Spark jobs are an integral part of our data platform and like our other production applications, we need Datadog instrumentation. We rely on Databricks to power those Spark workloads, but integrating Datadog and Databricks wasn’t turn-key. In this post, I’ll share the two code snippets necessary to enable this integration: a custom cluster init script, and a special class to load into the Spark job.

Rather than relying on the Spark UI in Databricks, piping these metrics into Datadog allows us to build extremely useful dashboards and more importantly monitors for our Spark workloads that can tie into our alerting infrastructure.

Configuring the Databricks cluster

When creating a cluster in Databricks, we setup and configure the Datadog agent with the following init script on the driver node:

#!/bin/bash
# reference: https://docs.databricks.com/clusters/clusters-manage.html#monitor-performance
#
# This init script takes the following environment variables as input
#   * DATADOG_API_KEY
#   * ENVIRONMENT
#   * APP_NAME

echo "Running on the driver? $DB_IS_DRIVER"

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  echo "Setting up metrics for spark applicatin: ${APP_NAME}"
  echo "Driver ip: $DB_DRIVER_IP"

  cat << EOF >> /home/ubuntu/databricks/spark/conf/metrics.properties
*.sink.statsd.host=${DB_DRIVER_IP}
EOF

  DD_INSTALL_ONLY=true \
      DD_AGENT_MAJOR_VERSION=7 \
      DD_API_KEY=${DATADOG_API_KEY} \
      DD_HOST_TAGS="[\"env:${ENVIRONMENT}\", \"spark_app:${APP_NAME}\"]" \
      bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/datadog-agent/7.22.0/cmd/agent/install_script.sh)"

  cat << EOF >> /etc/datadog-agent/datadog.yaml
use_dogstatsd: true
# bind on all interfaces so it's accessible from executors
bind_host: 0.0.0.0
dogstatsd_non_local_traffic: true
dogstatsd_stats_enable: false
logs_enabled: false
cloud_provider_metadata:
  - "aws"
EOF

  # NOTE: you can enable the following config for debugging purpose
  echo "dogstatsd_metrics_stats_enable: false" >> /etc/datadog-agent/datadog.yaml

  sudo service datadog-agent start
fi

The cluster also needs to be launched with the following environment variables in order to configure the integration:

  • ENVIRONMENT=development/staging/production
  • APP_NAME=your_spark_app_name
  • DATADOG_API_KEY=KEY

Once the cluster has been fully configured with the above init script, you can then send metrics to Datadog from Spark through the statsd port exposed by the agent. All your Datadog metrics will be automatically tagged with env and spark_app tags.

In practice, you can setup all of this using DCS (customized containers with Databricks Container Services) as well. But we decided against it in the end because we ran into many issues with DCS including out of date base images and lack of support for builtin cluster metrics.

Sending custom metrics from Spark

Integrating Statsd with Spark is very simple. To reduce boilerplate, we built an internal helper utility that wraps timgroup.statsd library:

import com.timgroup.statsd.{NonBlockingStatsDClientBuilder, StatsDClient}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.StreamingQueryListener

import scala.collection.JavaConverters._

/** Datadog class for automating Databricks <> Datadog integration.
 *
 * NOTE: this package relies on datadog agent to be installed and configured
 * properly on the driver node.
 */
class Datadog(val appName: String)(implicit spark: SparkSession) extends Serializable {
  val driverHost: String = spark.sparkContext.getConf
    .getOption("spark.driver.host")
    .orElse(sys.env.get("SPARK_LOCAL_IP"))
    .get

  def statsdcli(): StatsDClient = {
    new NonBlockingStatsDClientBuilder()
      .prefix(s"spark")
      .hostname(driverHost)
      .build()
  }

  val metricsTag = s"spark_app:$appName"

  def collectStreamsMetrics(): Unit = {
    spark.streams.addListener(new StreamingQueryListener() {
      val statsd: StatsDClient = statsdcli()
      override def onQueryStarted(queryStarted: StreamingQueryListener.QueryStartedEvent): Unit = {}
      override def onQueryTerminated(queryTerminated: StreamingQueryListener.QueryTerminatedEvent): Unit = {}
      override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {
        val progress = event.progress
        val queryNameTag = s"query_name:${progress.name}"
        statsd.gauge("streaming.batch_id", progress.batchId, metricsTag, queryNameTag)
        statsd.count("streaming.input_rows", progress.numInputRows, metricsTag, queryNameTag)
        statsd.gauge("streaming.input_rows_per_sec", progress.inputRowsPerSecond, metricsTag, queryNameTag)
        statsd.gauge("streaming.process_rows_per_sec", progress.processedRowsPerSecond, metricsTag, queryNameTag)
        progress.durationMs.asScala.foreach { case (op, v) =>
          statsd.gauge(
            "streaming.duration", v, s"operation:$op", metricsTag, queryNameTag)
        }
      }
    })
  }
}

To initializing the helper class takes two lines of code:

implicit val spark = SparkSession.builder().getOrCreate()
val datadog = new Datadog(AppName)

Then you can use datadog.statsdcli() to create statsd clients from within both driver and executors to emit custom emtrics:

val statsd = datadog.statsdcli()
statsd.count(s"${AppName}.foo_counter", 100)

Note: : Datadog agent flushes metrics on a preset interval that can be configured from the init script. By default, it’s 10 seconds. This means if your Spark application, running in a job cluster, exits immediately after a metric has been sent to Datadog agent, the agent won’t have enough time to forward that metric to Datadog before the Databricks cluster shuts down. To address this issue, you need to put a manual sleep at the end of the Spark application so Datadog agent has enough time to flush the newly ingested metrics.

Instrumenting Spark streaming app

User of the Datadog helper class can also push all Spark streaming progress metrics to Datadog with one line of code:

datadog.collectStreamsMetrics

This method sets up a streaming query listener to collect streaming progress metrics and send them to the Datadog agent. All streaming progress metrics will be tagged with spark_app and query_name tags. We use these streaming metrics to monitor streaming lag, issues with our batch sizes, and a number of other actionable metrics.

And that’s it for the application setup!


In the future a more “native” integration between Databricks and Datadog would be nice, but these two code snippets have helped bridge a crucial instrumentation and monitoring gap with our Spark workloads. On the Core Platform and Data Engineering teams we continue to invest in Spark and would love your help building out our reliable and high-performance data platform, come join us!