How to achieve data locality with Spark and Apache Ozone? Is it possible?

95 Views Asked by Pavel Orekhov At 04 July 2023 at 19:53

If we deploy both Apache Ozone and Apache Spark on kubernetes, is it possible to achieve data locality? Or will data always have to be shuffled upon read?

Original Q&A

There are 1 best solutions below

Nanda On 14 February 2024 at 13:28

tl;dr Yes, Ozone Client (Used by Apache Spark) will prefer reading from local node if the block is present on the same node.

Apache Spark uses Hadoop Filesystem Client (Which will call Ozone Client) to read data from Ozone.

For reads, Apache Ozone will sort the block list based on the distance from the client node (if network topology is configured, the sorting will be done based on the network topology).

If Apache Ozone and Apache Spark are co-located and there is a local copy of the block where Apache Spark is running, Ozone client will prefer reading the local copy. In case if there is no local copy, the read will go over network (if network topology is configured, the Ozone Client will prefer blocks from same rack).

This is implemented in HDDS-1586.

How to achieve data locality with Spark and Apache Ozone? Is it possible?

There are 1 best solutions below

Related Questions in APACHE-SPARK

Related Questions in OZONE

Trending Questions

Popular # Hahtags

Popular Questions