I want to control and reduce training time of my Big Linear Model on 100GB dataset. Is there a way to regulate parallelization scale of LinearSVC model in pyspark.mllib ? Is there some analogue of DataFrame.repartition(numPartitions) parameter ?
I expect to find a single parameter or method, which will allow LinearSVC model to occupy more cluster resourses and finish training faster. Observing spark job log, i've found that at some internal steps of training my data is being repartitioned from 1000-10000 partitions to 100 constantly. I believe there is some way to control parallelization. LinearSVC has two parameters aggregationDepth and maxBlockSizeInMB that can play a role i think, but spark documentation lacks of information.