Skip to content

Spark Jobs

This section provides information on how to use the MONIT infrastructure to execute spark jobs.

Introduction

Spark jobs are executed in the MONIT processing platform. This platform provides a fault-tolerant and fully orchestrated execution environment for processing monitoring data.

Spark jobs can be written in Python or Scala implementing streaming or batch workflows. The following versions must be used: * Spark: version 2.3 * Python: version 2.7.x, latest bugfix version. * Scala: version: 2.11.x, latest bugfix version. * Java: Java 8+, latest bugfix version. * SBT: version 0.13.x, latest packaged version for Red Hat.

To develop spark jobs you first need to setup a local environment. Afterwards you can run the job locally or in the MONIT processing platform.

Setup Environment

We recommend to setup an IDE. There are multiple solutions available for Scala and Python projects. Don't forget to put the IDE configuration files in gitignore. * ScalaIDE: pre-configured Eclipse IDE with Scala, mvn, git support * Eclipse: or derivates with Scala plugin (from Eclipse Marketplace -> ScalaIDE) and Python plugin (from Eclipse Marketplace -> PyDev) * IntelliJ IDEA: use the free Community version with the Scala plugin, for Python the equivalent product is PyCharm.

Start a new project

We provide a gitlab repository with several examples. It includes examples in Scala and Python to read/write data from Kafka, HDFS, and ES.

Despite the Python support we strongly suggest to use Scala as it better supported for Spark. In this recipe we will use the Scala examples.

  • Clone the repository: git clone https://:@gitlab.cern.ch:8443/monitoring/spark-examples.git.
  • Check you have the correct versions of the dependencies set in build.sbt.
  • Implement your changes... or simply go ahead with the example code.
  • Package the application: sbt assembly (inside the project folder).

Do not package the dependencies (e.g. Spark) into your jar. Make sure the target/ directory is not added to version control.

Run in Local/Dev mode

Spark jobs can be executed in a local environment.

  • Make sure you sucessfully packaged your project (check Start Project).
  • Check the examples repository for project-specific configuration steps.
  • Run the project (read_kafka example): spark-submit --master local target/scala-2.10/read_kafka-assembly-1.0.jar.

Run in Monit/Prod mode

Spark jobs are executed as Docker containers. User jobs are provided via dedicated user gitlab projects and built using gitlab CI registry service.

The supervision and orchestration of the jobs in the MONIT processing platform is done by Nomad.

  • Place your project in a dedicated gitlab repository.
  • Make sure you sucessfully packaged your project (check Start Project).
  • Make sure your docker runs using the "user_id" 1001, as the infrastructure will force this as the runner to limit docker permissions.
  • Enable "Container Registry" for your gitlab repository following these instructions.
  • Add the files required for gitlab CI. Example for dockerfile and gitlab-ci.
  • Commit and push all required files. Check the gitlab CI results.
  • Add diaglab user with Developer role in your project.
  • Open a SNOW request asking your project to be added in MONIT.
    • You should receive access to a project under the "userstreaming" repository
  • Check the execution logs of your project in MONIT under the it-db cluster using the "monit-userstreaming" tenant OS.
    • You can filter your job logs by using the "docker.attrs.NOMAD_JOB_NAME" field, which should point to "monit-\<projectname>-\<jobname>"