Currently, my main application is built with Java Spring-boot and this won't change because it's convenient.
@Autowired service beans implements, for example :

  • Enterprise and establishment datasets. The first one is also able to return a list of Enterprise objects that have a Map of their establishments.
    So the service returns : Dataset<Enterprise>, Dataset<Establishment>, Dataset<Row>
  • Associations : Dataset<Row>
  • Cities : Dataset<Commune> or Dataset<Row>,
  • Local authorities : Datatset<Row>.

Many user case functions are calls of this kind :

What are associations(year=2020) ?

And my applications forward to datasetAssociation(2020) that operates with enterprises and establishments datasets and with cities and local authorities ones to provide an useful result.

Many recommended me to benefit from Scala abilities

For this, I'm considering an operation involving other ones between datasets :

  • Some made of Row,
  • Some carrying concrete objects.

I have this operation to do, in term of datasets reached/involved :
associations.enterprises.establishments.cities.localautorities

Will I be able to write the bold part in Scala ? This means that :

  1. A Dataset<Row> built with Java code is sent to a Scala function to be completed.

  2. Scala creates a new dataset with Enterprise and Establishment objects.
    a) If the source of an object is written in Scala I don't have to recreate a new source for it in Java.
    b) conversely if the source of an object is written in Java, I don't have to recreate a new source in Scala.
    c) I can use a Scala object returned by this dataset on Java side directly.

  3. Scala will have to call functions kept implemented in Java and send them the underlying dataset it is creating (for example to complete them with cities information).

Java calls Scala methods at anytime
and Scala calls Java methods at anytime too :

an operation could follow a
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
path if wished, in term of native language of method called.
Because I don't know in advance what parts I will find useful to port in Scala or not.

Completing these three points, I will consider that Java and Scala are able interoperable the two way and benefit one from the other.

But may I achieve this goal (in Spark 2.4.x or more probably in Spark 3.0.0) ?

Summarizing, are Java and Scala interoperable the two ways, a manner that :

  • It does not make the source code too clumsy one side or the other. Or worst : duplicated.
  • It don't degrade performances strongly (having to recreate a whole dataset or convert each of the object it contains, one side or the other, for example, would be prohibitive).
2

There are 2 best solutions below

3
Juh_ On BEST ANSWER

As Jasper-M wrote, scala and java code are perfectly inter-operable:

  • they both compile into .class files that are executed the same way by the jvm
  • The spark java and scala API works together, with couple of specifics:
    • Both use the same Dataset class, so there are no issue there
    • However SparkContext and RDD (and all RDD variants) have scala api that aren't practical in java. Mainly because scala methods takes scala type as input that are not those you use in java. But there are java wrapper for both of them (JavaSparkContext, JavaRDD). Coding in java, you probably have seen those wrapper already.

Now, as many have recommended, spark being a scala library first, and the scala language being more powerful than java (*), using scala to write spark code will be much easier. Also, you will find much more code-example in scala. It is often difficult to find java code example for complex Dataset manipulation.

So, I think the two main issues you should be taking care of are:

  1. (not spark related, but necessary) have a project that compiles both language and allows two-way inter-operability. I think sbt provides it out-of-the-box, and with maven you need to use the scala plugin and (from my experience) put both java and scala files in the java folder. Otherwise one can call the other, but not the opposite (scala call java but java cannot call scala, or the other way around)
  2. You should be careful of the encoder that are used each time you create a typed Dataset (i.e. Dataset[YourClass] and not Dataset<Row>). In Java, and for java model classes, you need to use Encoders.bean(YourClass.class) explicitely. But in scala, by default spark find the encoder implicitly, and the encoders are build for scala case classes ("Product types") and scala standard collections. So just be mindful of which encoders are used. For example, if you create a Dataset of YourJavaClass in scala, I think you will probably have to give explicitly the Encoders.bean(YourJavaClass.class) for it to work and not have serialization issues.

One last note: you wrote that you use java Spring-boot. So

  • Be aware that Spring design goes completely against scala/functional recommended practice. Using null and mutable stuff all over. You can still use Spring, but it might be strange in scala, and the community will probably not accept it easily.
  • You can call spark code from a spring context, but should not use spring (context) from spark, especially inside methods distributed by spark, such as in rdd.map. This will attempt to create Spring context in each worker which is very slow and can easily fail.

(*) About "scala being more powerful than java": I don't mean that scala is better than java (well I do think so, but it is a matter of taste :). What I mean is that the scala language provides much more expressiveness than java. Basically it does more with less code. The main differences are:

  • implicits, which are heavily used by spark api
  • monad + for-comprehension
  • and of course the powerful type-system (read about co-variant types for example, a List[Dog] is a subclass of List[Animal] in scala, but not in java)
0
Jasper-M On

Yes, it is possible without performance degradations or overly clumsy extra code. Scala and Java are almost perfectly interoperable and moreover the Spark Dataset API is shared between Java and Scala. The Dataset class is exactly the same whether you are using Java or Scala. As you can see in the javadoc or scaladoc (note they only differ in layout, not in content) the Java and Scala code is perfectly interchangeable. At most the Scala code will be a bit more succinct.