Currently, my main application is built with Java Spring-boot and this won't change because it's convenient.
@Autowired service beans implements, for example :
- Enterprise and establishment datasets. The first one is also able to return a list of Enterprise objects that have a
Mapof their establishments.
So the service returns :Dataset<Enterprise>,Dataset<Establishment>,Dataset<Row> - Associations :
Dataset<Row> - Cities :
Dataset<Commune>orDataset<Row>, - Local authorities :
Datatset<Row>.
Many user case functions are calls of this kind :
What are associations(year=2020) ?
And my applications forward to datasetAssociation(2020) that operates with enterprises and establishments datasets and with cities and local authorities ones to provide an useful result.
Many recommended me to benefit from Scala abilities
For this, I'm considering an operation involving other ones between datasets :
- Some made of Row,
- Some carrying concrete objects.
I have this operation to do, in term of datasets reached/involved :
associations.enterprises.establishments.cities.localautorities
Will I be able to write the bold part in Scala ? This means that :
A
Dataset<Row>built with Java code is sent to a Scala function to be completed.Scala creates a new dataset with
EnterpriseandEstablishmentobjects.
a) If the source of an object is written in Scala I don't have to recreate a new source for it in Java.
b) conversely if the source of an object is written in Java, I don't have to recreate a new source in Scala.
c) I can use a Scala object returned by this dataset on Java side directly.Scala will have to call functions kept implemented in Java and send them the underlying dataset it is creating (for example to complete them with cities information).
Java calls Scala methods at anytime
and Scala calls Java methods at anytime too :
an operation could follow a
Java -> Scala -> Scala -> Java -> Scala -> Java -> Java
path if wished, in term of native language of method called.
Because I don't know in advance what parts I will find useful to port in Scala or not.
Completing these three points, I will consider that Java and Scala are able interoperable the two way and benefit one from the other.
But may I achieve this goal (in Spark 2.4.x or more probably in Spark 3.0.0) ?
Summarizing, are Java and Scala interoperable the two ways, a manner that :
- It does not make the source code too clumsy one side or the other. Or worst : duplicated.
- It don't degrade performances strongly (having to recreate a whole dataset or convert each of the object it contains, one side or the other, for example, would be prohibitive).
As Jasper-M wrote, scala and java code are perfectly inter-operable:
Now, as many have recommended, spark being a scala library first, and the scala language being more powerful than java (*), using scala to write spark code will be much easier. Also, you will find much more code-example in scala. It is often difficult to find java code example for complex Dataset manipulation.
So, I think the two main issues you should be taking care of are:
Dataset[YourClass]and notDataset<Row>). In Java, and for java model classes, you need to useEncoders.bean(YourClass.class)explicitely. But in scala, by default spark find the encoder implicitly, and the encoders are build for scala case classes ("Product types") and scala standard collections. So just be mindful of which encoders are used. For example, if you create a Dataset of YourJavaClass in scala, I think you will probably have to give explicitly theEncoders.bean(YourJavaClass.class)for it to work and not have serialization issues.One last note: you wrote that you use java Spring-boot. So
rdd.map. This will attempt to create Spring context in each worker which is very slow and can easily fail.(*) About "scala being more powerful than java": I don't mean that scala is better than java (well I do think so, but it is a matter of taste :). What I mean is that the scala language provides much more expressiveness than java. Basically it does more with less code. The main differences are: