How to achieve data locality with Spark and Apache Ozone? Is it possible?

95 Views Asked by At

If we deploy both Apache Ozone and Apache Spark on kubernetes, is it possible to achieve data locality? Or will data always have to be shuffled upon read?

1

There are 1 best solutions below

0
Nanda On

tl;dr Yes, Ozone Client (Used by Apache Spark) will prefer reading from local node if the block is present on the same node.

Apache Spark uses Hadoop Filesystem Client (Which will call Ozone Client) to read data from Ozone.

For reads, Apache Ozone will sort the block list based on the distance from the client node (if network topology is configured, the sorting will be done based on the network topology).

If Apache Ozone and Apache Spark are co-located and there is a local copy of the block where Apache Spark is running, Ozone client will prefer reading the local copy. In case if there is no local copy, the read will go over network (if network topology is configured, the Ozone Client will prefer blocks from same rack).

This is implemented in HDDS-1586.