CSV API

Note

  • This API wraps CSV reader under the hood.
  • Read CSV and Write as CSV to HDFS.

Create hive table

CREATE external TABLE `pcatalog.sample_csv`(
  `payload` string)
LOCATION '/user/LOCATION/csv_data11/csv12'
TBLPROPERTIES (
  'gimel.storage.type'='HDFS',
  'gimel.hdfs.data.format'='csv')

Catalog Properties

Property Mandatory? Description Example Default
gimel.hdfs.save.mode N
Append - adds data to path
Overwrite - Overwrites data in Path
Append Overwrite
gimel.hdfs.csv.data.headerProvided N Whether API should infer header from data
true
false
true
gimel.hdfs.data.format Y Property helps API infer whether to invoke CSV reader/writer csv

Common Imports

import com.paypal.gimel.DataSet
import org.apache.spark.sql._


Write to CSV

// Prepare Test Data
def stringed(n: Int) = s"""{"id": ${n}, "name": "MAC-${n}", "rev": ${n * 10000}}"""
val texts: Seq[String] = (1 to 100).map { x => stringed(x) }.toSeq
val rdd: RDD[String] = sparkSession.sparkContext.parallelize(texts)
val df: DataFrame = sparkSession.read.json(rdd)
//Initiate DataSet
val dataset = com.paypal.gimel.DataSet(sparkSession)
//DataSet Name
val datasetName = "pcatalog.sample_csv"
///Options set by default
///pcatalog.hdfs.save.mode = Overwrite
///User can change the save mode by using Options
val options = Map(("gimel.hdfs.save.mode","Append"))
//write some data
dataset.write(datasetName,df,options)

Read from CSV

//Initiate DataSet
val dataset = com.paypal.gimel.DataSet(sparkSession)
//DataSet Name
val datasetName = "pcatalog.sample_csv"
///Options set by default
///pcatalog.csv.data.headerProvided = true
///User can set header to false by using Options
val options = Map(("gimel.hdfs.csv.data.headerProvided","false"))
//Read some data
dataset.read(datasetName,options)