CSV API
Note
- This API wraps CSV reader under the hood.
- Read CSV and Write as CSV to HDFS.
Create hive table
CREATE external TABLE `pcatalog.sample_csv`(
`payload` string)
LOCATION '/user/LOCATION/csv_data11/csv12'
TBLPROPERTIES (
'gimel.storage.type'='HDFS',
'gimel.hdfs.data.format'='csv')
Catalog Properties
Property | Mandatory? | Description | Example | Default |
---|---|---|---|---|
gimel.hdfs.save.mode | N | Append - adds data to path Overwrite - Overwrites data in Path |
Append | Overwrite |
gimel.hdfs.csv.data.headerProvided | N | Whether API should infer header from data | true false |
true |
gimel.hdfs.data.format | Y | Property helps API infer whether to invoke CSV reader/writer | csv |
Common Imports
import com.paypal.gimel.DataSet
import org.apache.spark.sql._
Write to CSV
// Prepare Test Data
def stringed(n: Int) = s"""{"id": ${n}, "name": "MAC-${n}", "rev": ${n * 10000}}"""
val texts: Seq[String] = (1 to 100).map { x => stringed(x) }.toSeq
val rdd: RDD[String] = sparkSession.sparkContext.parallelize(texts)
val df: DataFrame = sparkSession.read.json(rdd)
//Initiate DataSet
val dataset = com.paypal.gimel.DataSet(sparkSession)
//DataSet Name
val datasetName = "pcatalog.sample_csv"
///Options set by default
///pcatalog.hdfs.save.mode = Overwrite
///User can change the save mode by using Options
val options = Map(("gimel.hdfs.save.mode","Append"))
//write some data
dataset.write(datasetName,df,options)
Read from CSV
//Initiate DataSet
val dataset = com.paypal.gimel.DataSet(sparkSession)
//DataSet Name
val datasetName = "pcatalog.sample_csv"
///Options set by default
///pcatalog.csv.data.headerProvided = true
///User can set header to false by using Options
val options = Map(("gimel.hdfs.csv.data.headerProvided","false"))
//Read some data
dataset.read(datasetName,options)