Cross Cluster HDFS API


How does it work?

Cross Cluster API is nothing but HDFS API that can read across clusters even if they are kerberized.

For Cross Cluster read/write to work - While starting the spark application - add the namenodes of all the clusters that you are going to work with. Pass the following spark conf while starting the spark app. Example for working with several clusters. bash "spark.yarn.access.namenodes":"hdfs://cluster_2_name_node:8020/,hdfs://cluster_2_name_node:8020/"

Alluxio - The Same API works with Alluxio as well, provided the property gimel.hdfs.nn starts with string alluxio.

Under the hood, we are using the spark read API. Hence where ever applicable : a filter on a partitioned column will result in pruning.


Cross Cluster Reads have potential pitfalls too : if the source path has terabytes of data & there is no pruning of partitions / paths - this will lead to long running job that attempts to transfer TeraBytes of data across cluster. Cross cluster API is not meant for this type of usecase.

Preemptive Guard

With above limitation in consideration - we have a guard that users can leverage when they are not sure about the volume that is going to be read from the source cluster. Property, when set to say 250 - will consider the read-size-threshold to be 250GB. Preemptively - before the API reads source path : a check on the source path's size will be compared against 250GB. If size > 250GB - an exception with be thrown that threshold is exceeded.

Create hive table

CREATE EXTERNAL TABLE `pcatalog.gimel_xcluster_pi`(
  `payload` string)

Catalog Properties

Property Mandatory? Description Example Default Y Format of the data
text Y HDFS location /tmp/examples/
gimel.hdfs.nn Y Name Notde URI
the namenode of cluster where spark application is running currently Y Name of cluster

Common Imports

import com.paypal.gimel._
import org.apache.spark.sql._

API Usage

Read from different Cluster

//Initiate DataSet
val dataset = DataSet(sparkSession)
//DataSet Name
///Options set by default
///Maximum Threshold data to query is 50GB by default
///To increase Threshold
val options = Map("" -> "250");
//Read some data"pcatalog.gimel_xcluster_pi",options)

SQL Support

val data = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch("select * from pcatalog.gimel_xcluster_pi",sparkSession);