Cross Cluster HDFS API

Note

How does it work?

Cross Cluster API is nothing but HDFS API that can read across clusters even if they are kerberized.

For Cross Cluster read/write to work - While starting the spark application - add the namenodes of all the clusters that you are going to work with. Pass the following spark conf while starting the spark app. Example for working with several clusters. bash "spark.yarn.access.namenodes":"hdfs://cluster_2_name_node:8020/,hdfs://cluster_2_name_node:8020/"

Alluxio - The Same API works with Alluxio as well, provided the property gimel.hdfs.nn starts with string alluxio.

Under the hood, we are using the spark read API. Hence where ever applicable : a filter on a partitioned column will result in pruning.

Limitations

Cross Cluster Reads have potential pitfalls too : if the source path has terabytes of data & there is no pruning of partitions / paths - this will lead to long running job that attempts to transfer TeraBytes of data across cluster. Cross cluster API is not meant for this type of usecase.

Preemptive Guard

With above limitation in consideration - we have a guard that users can leverage when they are not sure about the volume that is going to be read from the source cluster. Property gimel.hdfs.data.crosscluster.threshold, when set to say 250 - will consider the read-size-threshold to be 250GB. Preemptively - before the API reads source path : a check on the source path's size will be compared against 250GB. If size > 250GB - an exception with be thrown that threshold is exceeded.

Create hive table

CREATE EXTERNAL TABLE `pcatalog.gimel_xcluster_pi`(
  `payload` string)
LOCATION
  'hdfs:///tmp/pcatalog/xcluster_remote_hadoop_cluster_text_smoke_test'
TBLPROPERTIES (
  'gimel.hdfs.data.format'='text',
  'gimel.hdfs.data.location'='/tmp/examples/pi.py',
  'gimel.hdfs.nn'='hdfs://remote_hadoop_cluster_namenode:8020',
  'gimel.hdfs.storage.name'='remote_hadoop_cluster',
  'gimel.storage.type'='HDFS'
  )

Catalog Properties

Property	Mandatory?	Description	Example	Default
gimel.hdfs.data.format	Y	Format of the data	text parquet csv orc	text
gimel.hdfs.data.location	Y	HDFS location	/tmp/examples/pi.py
gimel.hdfs.nn	Y	Name Notde URI	hdfs://remote_hadoop_cluster_name_node:8020 alluxio://namenode:19998	the namenode of cluster where spark application is running currently
gimel.hdfs.storage.name	Y	Name of cluster	remote_hadoop_cluster remote_hadoop_cluster11 alluxio

Common Imports

import com.paypal.gimel._
import org.apache.spark.sql._

API Usage

Read from different Cluster

//Initiate DataSet
val dataset = DataSet(sparkSession)
//DataSet Name
///Options set by default
///Maximum Threshold data to query is 50GB by default
///To increase Threshold
val options = Map("gimel.hdfs.data.crosscluster.threshold" -> "250");
//Read some data
dataset.read("pcatalog.gimel_xcluster_pi",options)

SQL Support

sparkSession.sql("set gimel.hdfs.data.crosscluster.threshold=250");
val data = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch("select * from pcatalog.gimel_xcluster_pi",sparkSession);
data.show