Gimel Data API

What is gimel?

Gimel is a Big Data Abstraction framework built on Apache Spark & other open source connectors in the industry.
Gimel provides unified Data API to read & write data to various stores.
Alongside, a unified SQL access pattern for all stores alike.
The APIs are available in both scala & python (pyspark).

Scala

/* Simple Data API example of read from kafka, transform & write to elastic */

// Initiate API
val dataset = com.paypal.gimel.DataSet(spark)

// Read Data | kafka semantics abstracted for user. 
// Refer "Gimel Catalog Providers" that abstracts dataset details
val df: DataFrame = dataset.read("kafka_dataset")

// Apply transformations (business logic | abstracted for Gimel)
val transformed_df: DataFrame = df(...transformations...)

// Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)

/* GSQL Reference */

// Create Gimel SQL reference
val gsql: (String) => DataFrame = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch(_: String, spark)

// your SQL
val sql = """
insert into elastic_dataset
select * from kafka_dataset
"""

gsql(sql)

Python | pyspark


# import DataFrame and SparkSession
from pyspark.sql import DataFrame, SparkSession, SQLContext

# fetch reference to the class in JVM
ScalaDataSet = sc._jvm.com.paypal.gimel.DataSet

# fetch reference to java SparkSession
jspark = spark._jsparkSession

# initiate dataset
dataset = ScalaDataSet.apply(jspark)

# Read Data | kafka semantics abstracted for user
df = dataset.read("kafka_dataset")

# Apply transformations (business logic | abstracted for Gimel)
transformed_df = df(...transformations...)

# Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)

# fetch reference to GimelQueryProcessor Class in JVM
gsql = sc._jvm.com.paypal.gimel.scaas.GimelQueryProcessor

# your SQL
sql = """
insert into elastic_dataset
select * from kafka_dataset
"""

# Set some props
gsql.executeBatch("set es.nodes.wan.only=true", jspark)

# execute GSQL, this can be any sql of type "insert into ... select .. join ... where .."
gsql.executeBatch(sql, jspark)

Gimel overview

2020 - Gimel @ Scale By The Bay, Online

Click here for slideshare

2020 - Gimel @ Data Orchestration Summit By Alluxio, Online

Click here for slideshare

2018 - Gimel @ QCon.ai, SF

Click here for slideshare

Stack & Version Compatibility

Compute/Storage/Language	Version	Grade	Documentation	Notes
	2.12.10	PRODUCTION		Data API is built on scala 2.12.10 regardless the library should be compatible as long as the spark major version of library and the environment match
	3x	PRODUCTION	PySpark Support	Data API works fully well with PySpark as long as spark version in environment & Gimel library matches.
	2.4.7	PRODUCTION		This is the recommended version
	2.10.0	PRODUCTION		This is the recommended version
	1.10.6	PRODUCTION	S3 Doc
	0.17.3	PRODUCTION	Big Query Doc
	14	PRODUCTION	Teradata Doc	Uses JDBC Connector internally
	2.3.7	PRODUCTION	Hive Doc
	2.1.1	PRODUCTION	Kafka 2.2 Doc	V2.1.1 is the PayPal's Supported Version of Kafka
	0.82	PRODUCTION	SFTP Doc	Read/Write files from/To SFTP server
	6.2.1	PRODUCTION	ElasticSearch Doc
	NA	PRODUCTION WITH LIMITATIONS	Restful/Web-API Doc	Allows Accessing Data - to any source supporting - Rest API
	3.1.5	EXPERIMENTAL	Aerospike Doc	Experimental API for Aerospike reads / writes
	2.0	EXPERIMENTAL	Cassandra Doc	Experimental API for Cassandra reads / writes Leverages DataStax Connector
Gimel Serde	1.0	PRODUCTION	Gimel Serde Doc	Pluggable gimel serializers and deserializers

Gimel Data API

Contents

What is gimel?

Scala

Python | pyspark

Gimel overview

2020 - Gimel @ Scale By The Bay, Online

2020 - Gimel @ Data Orchestration Summit By Alluxio, Online

2018 - Gimel @ QCon.ai, SF

Stack & Version Compatibility

Questions