LibSVM: Understanding the data format

293 Views Asked by At

I am currently experimenting with the LibSVM format as a standardized format for exchanging label/feature data sets between Python and Java in a Spark project. However, I am a bit confused by the multiple files starting with 'part-000*' that are created when saving the data (originally in Pandas DF, converted to RDD and LabeledPoints) using Spark's MLUtil.util.saveAsLibSVMFile().

Why is the data split across multiple files and how can I save it to a single text file?
Or, alternatively, how can I read these multiple 'part-0000*' files?

AFAICS, the method loadLibSVMFile() in Spark's MLUtils.util requires a single file, which is strange; saveAsLibSVMFile() in the same util module will produce multiple files. Why this inconsistency?

0

There are 0 best solutions below