Suppose you are modelling railway tracks and train stations using the following classes:
sealed abstract class ConnectedElement extends java.io.Serializable {
def location: String
}
case class RailwayTrack(location: String, length: Double = 0.0)
extends ConnectedElement
case class TrainStation(location: String, isRural: Boolean = false)
extends ConnectedElement
In the first approximation, we can assume that the graph modelling the connections between these elements is a path graph, meaning that each RailwayTrack should sit between two TrainStation. For example, this is a possible graph (RT represents railway tracks and TS train stations):
TS - RT - RT - RT - TS - RT - RT - TS - RT - TS
We could evolve this simple model to also include other elements in the hierarchy defined above (i.e. elements that sit between railway tracks that are not train stations) or to allow train stations to be connected to multiple paths to other train stations, for example:
TS
|
RT
|
TS - RT - X - RT - TS - RT - RT - TS
Suppose the following RDD represent a GraphX Graph of the first type:
import org.apache.spark.sql.SparkSession
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val spark: SparkSession = SparkSession.builder().master("local").getOrCreate()
val sc = spark.sparkContext
val elements: RDD[(VertexId, ConnectedElement)] =
sc.parallelize(
Seq(
(1L, RailwayTrack("l1")),
(2L, RailwayTrack("l2")),
(3L, RailwayTrack("l3")),
(4L, RailwayTrack("l4")),
(5L, RailwayTrack("l1")),
(6L, RailwayTrack("l4")),
(7L, RailwayTrack("l5")),
(8L, TrainStation("l2")),
(9L, TrainStation("l4")),
(10L, TrainStation("l3"))
)
)
val connections: RDD[Edge[String]] =
sc.parallelize(
Seq(
Edge(8L, 1L),
Edge(1L, 2L),
Edge(2L, 3L),
Edge(3L, 9L),
Edge(9L, 4L),
Edge(4L, 5L),
Edge(5L, 6L),
Edge(6L, 7L),
Edge(7L, 10L)
)
)
val graph = Graph(elements, connections)
I would like to write an algorithm that returns something like:
val expected = sc.parallelize(Seq(
((1L, RailwayTrack("l1")), (8L, TrainStation("l2")), (9L, TrainStation("l4"))),
((2L, RailwayTrack("l2")), (8L, TrainStation("l2")), (9L, TrainStation("l4"))),
((3L, RailwayTrack("l3")), (8L, TrainStation("l2")), (9L, TrainStation("l4"))),
((4L, RailwayTrack("l4")), (9L, TrainStation("l4")), (10L, TrainStation("l3"))),
((5L, RailwayTrack("l1")), (9L, TrainStation("l4")), (10L, TrainStation("l3"))),
((6L, RailwayTrack("l4")), (9L, TrainStation("l4")), (10L, TrainStation("l3"))),
((7L, RailwayTrack("l5")), (9L, TrainStation("l4")), (10L, TrainStation("l3")))
))
That is: an RDD of tuples of three elements, where the first is the railway track vertex, the second and the third are the train station vertices that enclose the railway track vertex.
How can this problem be solved using GraphX? Can it be extended to more complex graphs, provided that a railway track always sits between two train stations?