I am trying to understand the "add" and "extract" methods of the FPTree class: (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala).
- What is the purpose of 'summaries' variable?
- where is the Group list? I assume it is the following, am I correct:
val numParts = if (numPartitions > 0) numPartitions else data.partitions.length val partitioner = new HashPartitioner(numParts)
- What will 'summaries contain for 3 transactions of {a,b,c} , {a,b} , {b,c} where all are frequent?
def add(t: Iterable[T], count: Long = 1L): FPTree[T] = { require(count > 0) var curr = root curr.count += count t.foreach { item => val summary = summaries.getOrElseUpdate(item, new Summary) summary.count += count val child = curr.children.getOrElseUpdate(item, { val newNode = new Node(curr) newNode.item = item summary.nodes += newNode newNode }) child.count += count curr = child } this } def extract( minCount: Long, validateSuffix: T => Boolean = _ => true): Iterator[(List[T], Long)] = { summaries.iterator.flatMap { case (item, summary) => if (validateSuffix(item) && summary.count >= minCount) { Iterator.single((item :: Nil, summary.count)) ++ project(item).extract(minCount).map { case (t, c) => (item :: t, c) } } else { Iterator.empty } } }
After a bit experiments, it is pretty straight forward:
1+2) The partition is indeed the Group representative. It is also how the conditional transactions calculated: