How is it possible to find out which "represented sequence(s)" are represented by which “representative sequence(s)”?
For example, in the following example, is there a way to find the original 627 sequences represented by “r1”?
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, diss=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
[>] criterion: density
[>] 2000 sequence(s) in the original data set
[>] 4 representative sequences
[>] overall quality: 0.08113734
[>] statistics for the representative set:
na na(%) nb nb(%) SD MD DC V Q
r1 627 31.4 225 11.25 4566 7.28 4856 4.73 5.97
r2 577 28.8 123 6.15 4305 7.46 5175 5.05 16.81
r3 411 20.5 115 5.75 2658 6.47 2394 4.34 -11.04
r4 385 19.2 93 4.65 3006 7.81 3393 5.57 11.42
Total 2000 100.0 556 27.80 14535 7.27 15818 7.91 8.11
na: number of assigned objects
nb: number of objects in the neighborhood
SD: sum of the na distances to the representative
MD: mean of the na distances to the representative
DC: sum of the na distances to the center of the complete set
V: discrepancy of the subset
Q: quality of the representative
A complementary question. It would be great if there would be more explanation/clarification on how "na" and "nb" should be read and interpreted. For example, are the 4 representative sequences (r1, r2, r3, r4) representing the 2000 sequences or just the 556 sequences?
I tried to find answers to my questions.
The sequences assigned to each representative can be retrieved from the
"Distancesattribute of the object returned byseqdef. I illustrate following up your example:Regarding your complementary question:
Each sequence is assigned to the closer representative sequence and
na[i]is the total number of sequences assigned tori.Now, the neighborhood of each representative is defined by the
pradiusargument (by default, 10% of the maximum distance).nb[i]is the number out of thena[i]sequences that are in the neighborhood ofri.A sequence can be assigned to a representative without being in its neighborhood. It can also be in the neighborhood of a representative but be assigned (i.e., closer) to another representative.
For the example, the sum of the
nb's tells us that 556 sequences are covered, i.e., in the neighborhood of at least one of the representatives. The sum of thena's is always the total number of sequences.