Gimel Data API

Contents

What is gimel Gimel Overview APIs & Version Compatibility
Getting Started Gimel Catalog Providers Edit on GitHub
Questions Contribution Guidelines Adding a new connector

What is gimel?

  • Gimel is a Big Data Abstraction framework built on Apache Spark & other open source connectors in the industry.
  • Gimel provides unified Data API to read & write data to various stores.
  • Alongside, a unified SQL access pattern for all stores alike.
  • The APIs are available in both scala & python (pyspark).

Scala

/* Simple Data API example of read from kafka, transform & write to elastic */

// Initiate API
val dataset = com.paypal.gimel.DataSet(spark)

// Read Data | kafka semantics abstracted for user. 
// Refer "Gimel Catalog Providers" that abstracts dataset details
val df: DataFrame = dataset.read("kafka_dataset")

// Apply transformations (business logic | abstracted for Gimel)
val transformed_df: DataFrame = df(...transformations...)

// Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)

/* GSQL Reference */

// Create Gimel SQL reference
val gsql: (String) => DataFrame = com.paypal.gimel.sql.GimelQueryProcessor.executeBatch(_: String, spark)

// your SQL
val sql = """
insert into elastic_dataset
select * from kafka_dataset
"""

gsql(sql)

Python | pyspark


# import DataFrame and SparkSession
from pyspark.sql import DataFrame, SparkSession, SQLContext

# fetch reference to the class in JVM
ScalaDataSet = sc._jvm.com.paypal.gimel.DataSet

# fetch reference to java SparkSession
jspark = spark._jsparkSession

# initiate dataset
dataset = ScalaDataSet.apply(jspark)

# Read Data | kafka semantics abstracted for user
df = dataset.read("kafka_dataset")

# Apply transformations (business logic | abstracted for Gimel)
transformed_df = df(...transformations...)

# Write Data | Elastic semantics abstracted for user
dataset.write("elastic_dataset",df)

# fetch reference to GimelQueryProcessor Class in JVM
gsql = sc._jvm.com.paypal.gimel.scaas.GimelQueryProcessor

# your SQL
sql = """
insert into elastic_dataset
select * from kafka_dataset
"""

# Set some props
gsql.executeBatch("set es.nodes.wan.only=true", jspark)

# execute GSQL, this can be any sql of type "insert into ... select .. join ... where .."
gsql.executeBatch(sql, jspark)

Gimel overview

2020 - Gimel @ Scale By The Bay, Online

2020 - Gimel @ Data Orchestration Summit By Alluxio, Online

2018 - Gimel @ QCon.ai, SF


Stack & Version Compatibility

Compute/Storage/Language Version Grade Documentation Notes
2.12.10 PRODUCTION
Data API is built on scala 2.12.10
regardless the library should be compatible as long as the spark major version of library and the environment match
3x PRODUCTION PySpark Support Data API works fully well with PySpark as long as spark version in environment & Gimel library matches.
2.4.7 PRODUCTION This is the recommended version
2.10.0 PRODUCTION This is the recommended version
1.10.6 PRODUCTION S3 Doc
0.17.3 PRODUCTION Big Query Doc
14 PRODUCTION Teradata Doc Uses JDBC Connector internally
2.3.7 PRODUCTION Hive Doc
2.1.1 PRODUCTION Kafka 2.2 Doc V2.1.1 is the PayPal's Supported Version of Kafka
0.82 PRODUCTION SFTP Doc Read/Write files from/To SFTP server
6.2.1 PRODUCTION ElasticSearch Doc
NA PRODUCTION WITH LIMITATIONS Restful/Web-API Doc
Allows Accessing Data
- to any source supporting
- Rest API
3.1.5 EXPERIMENTAL Aerospike Doc Experimental API for Aerospike reads / writes
2.0 EXPERIMENTAL Cassandra Doc
Experimental API for Cassandra reads / writes
Leverages DataStax Connector

Gimel Serde
1.0 PRODUCTION Gimel Serde Doc Pluggable gimel serializers and deserializers

Questions