Spark

What is Spark?

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Spark in Deepnote

Deepnote is a great place for working with Spark! This combination allows you to leverage:

Spark's rich ecosystem of tools and its powerful parallelization
Deepnote's beautiful UI, its set of AI generative tools, the collaborative workspace and data apps

Connecting to a remote cluster

A strong motivation for using Spark is its ability to process massive amounts of data, often using large clusters at the major cloud providers (AWS EMR, GCP Dataproc, Databricks or Azure HDInsight), or managed internally by your staff. You can use those as the back-end for your heavy computation, while using Deepnote as the client thanks to the new decoupled client-server architecture called Spark Connect introduced in Spark 3.4.0.

Requirements

On your cluster:

Spark >= 3.4.0 on your cluster
Ensure secure network connectivity, by picking one of the options here
Start the Spark server with Spark Connect, read docs

In your Deepnote project:

PySpark >= 3.4.0

For example, you can use the jupyter/all-spark-notebook Dockerhub image as a starting point, and install PySpark as part of initialization, but ideally this is baked into the image.

General instructions

For AWS EMR, GCP Dataproc, Azure HDInsight or other clusters, follow the instructions in the Spark documentation.

from pyspark.sql import SparkSession

# This example uses a remote EMR cluster 
spark = SparkSession.builder.remote("sc://ec2-1-2-3-4.compute-1.amazonaws.com:15002").getOrCreate()

Databricks

For Databricks, you can leverage Databricks Connect.

!pip3 install --upgrade "databricks-connect==13.0.*"

Or X.Y.* to match your cluster version

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.remote(
  host       = "my_host",
  token      = "my_token",
  cluster_id = "my_cluster_id",
  
).getOrCreate()

Interfacing with Deepnote features

Some features, such as Chart blocks, require the data to be in Pandas DataFrames. You can use the .toPandas() function to collect a remote Spark DataFrame as a local Pandas DataFrame. Make sure that the data will fit into the memory of your Deepnote machine -> either pick a larger machine, or aggregate or sample the data before the conversion.