In this article I look at the extent of Clojure support for some popular on-cluster data processing tools that Clojure users might need for their data engineering or data science tasks. Then for Snowflake in particular I go further and present a new Clojure API.
Why is the level of Clojure support important? As an example, consider that Scicloj is mostly focused on in-memory processing. As such, if you need to work with a large dataset it will be necessary to compute on-cluster and extract a smaller result before continuing your data science task locally.
However, without sufficient Clojure support for on-cluster processing, anyone needing that facility for their data science or data engineering task would be forced to reach outside the Clojure ecosystem. That adds complexity in terms of interop, compatibility and overall stack requirements.
With that in mind, let's examine the level of Clojure support for some popular on-cluster data processing tools. For each tool I selected its official Clojure library if one exists, or if not the most popular and well-known community-supported alternative with at least 100 stars and 10 contributors on GitHub. I then used the following criteria against the library to classify it as "supported" or "support unknown":
If I couldn't find any such library at all, I classified it as having "no support".
Tool Category | Supported | Support Unknown | No Support |
---|---|---|---|
On-cluster batch processing | 1. Spark (see Spark Interop with Geni below) | ||
On-cluster stream processing | 2. Kafka Streams (see Kafka Interop with Jackdaw below) | 3. Spark Structured Streaming, 4. Flink | |
On-cluster batch and stream processing | 5. Databricks (see Spark Interop with Geni below), 6. Snowflake (see Snowflake Interop below) |
Please note, I don't wish to make any critical judgments based on either the summary analysis above or the more detailed analysis below. The goal is to understand the situation with respect to Clojure support and highlight any gaps, although I suppose I am also inadvertently highlighting the difficulties of maintaining open source software!
Geni is the go-to library for Spark interop. Some months back, I was motivated to evaluate the coverage of Spark features. In particular, I wanted to understand what would be involved to support Spark Connect as it would reduce the complexity of computing on-cluster directly from the Clojure REPL.
However, I found a number of issues that would need to be addressed in order to support Spark Connect and Databricks:
Also, in general by my criteria the support classification is "support unknown":
Jackdaw is the go-to library for Kafka interop. However, by my criteria the support classification is also "support unknown":
Although the Snowpark library has Java and Scala bindings, it doesn't provide anything for Clojure. As such, it's currently not possible to interact with Snowflake using the Clojure way.
To address this gap, I decided to try my hand at creating a Clojure API for Snowflake as part of a broader effort to improve the overall situation regarding Clojure support for popular data tools.
The aim is to validate this approach as a foundation for enabling a wide range of data science or data engineering use cases from the Clojure REPL, in situations where Snowflake is the data warehouse of choice.
The README provides usage examples for all the current features, but I've copied the essential ones here to illustrate the API:
(require '[snowpark-clj.core :as sp])
;; Sample data
(def employee-data
[{:id 1 :name "Alice" :age 25 :department "Engineering" :salary 75000}
{:id 2 :name "Bob" :age 30 :department "Marketing" :salary 65000}
{:id 3 :name "Charlie" :age 35 :department "Engineering" :salary 80000}])
;; Create session and save data
(with-open [session (sp/create-session "snowflake.edn")]
(-> employee-data
(sp/create-dataframe session)
(sp/save-as-table "employees" :overwrite)))
(with-open [session (sp/create-session "snowflake.edn")]
(let [table-df (sp/table session "employees")]
(-> table-df
(sp/filter (sp/gt (sp/col table-df :salary) (sp/lit 70000)))
(sp/select [:name :salary])
(sp/collect))))
;; => [{:name "Alice" :salary 75000} {:name "Charlie" :salary 80000}]
As an early-stage proof-of-concept, it only covers the essential parts of the underlying API without being too concerned with performance or completeness. Other more advanced features are noted and planned, pending further elaboration.
I hope you find it useful and I welcome any feedback or contributions!
Published: 2025-08-28
Tagged: dataengineering clojure spark datascience scicloj kafka flink snowflake