Consider the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"main": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"component": [
[1, 2],
[np.nan],
[3, 8],
[np.nan],
[1, 5, 6],
[np.nan],
[7],
[np.nan],
[9, 10],
[np.nan],
[np.nan],
],
}
)
The column main represents a certain approach. Each approach consists of components. A component itself could also be an approach and is then called sub-approach.
I want to find all connected sub-approaches/components for a certain approach.
Suppose, for instance, I want to find all connected sub-approaches/components for the main approach '0'. Then, my desired output would look like this:
target = pd.DataFrame({
"main": [0, 0, 2, 2, 8, 8],
"component": [1, 2, 3, 8, 9, 10]
})
Ideally, I want to be able to just choose the approach and then get all sub-connections.
I am convinced that there is a smart approach to do so using networkx. Any hint is appreciated.
Ultimately, I want to create a graph that looks somewhat like this (for approach 0):
Additional information:
You can explode the data frame and then remove all components from the main column (components are approaches that do not have any component).
df_exploded = df.explode(column="component").dropna(subset="component")
The graph can be constructed as follows:
import networkx as nx
import graphviz
G = nx.Graph()
G.add_edges_from([(i, j) for i, j in target.values])
graph_attr = dict(rankdir="LR", nodesep="0.2")
g = graphviz.Digraph(graph_attr=graph_attr)
for k, v in G.nodes.items():
g.node(str(k), shape="box", style="filled", height="0.35")
for n1, n2 in G.edges:
g.edge(str(n2), str(n1))
g

You can use
nx.dfs_edgesOutput:
To extract the subgraph, use: