What is the TREC 2006 Spam Track Public Corpora Format?

44 Views Asked by At

link to original dataset
I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz). Here is the folder structure:

.
└── trec 06p/
    ├── data
    ├── data-delay
    ├── full
    ├── full-delay
    ├── ham25
    ├── ham25-delay
    ├── ham50
    ├── ham50-delay
    ├── spam25
    ├── spam25-delay
    ├── spam50
    └── spam50-delay

Some questions:

  1. What is the delay for? (e.g. data-delay, full-delay)
  2. What does full mean in this case? (is it just the labels?)
  3. What is the difference between HAM and ham in the full-delay subfolder?
  4. Why is the data-delay folder empty?
  5. Is there any special way to parse the contents in the data folder?
1

There are 1 best solutions below

0
alvas On BEST ANSWER

Disclaimer

Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have on the dataset.


Educated Guessed Answers

First, reading the task paper helps https://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf =)

Next, the right download link for future readers would be https://plg.uwaterloo.ca/~gvcormac/treccorpus06/

And now, some summary:

  • TREC 2006 Spam Track dataset is "a set of chronologically ordered email messages a spam filter for classification"
  • Four different forms of user feedback are modeled with
    • immediate feedback
      • the gold standard for each message is communicated to the filter immediately following classification;
    • delayed feedback
      • the gold standard is communicated to the filter sometime later (or potentially never), so as to model a user reading email from time to time and perhaps not diligently reporting the filter’s errors;
    • partial feedback
      • the gold standard for only a subset of email recipients is transmitted to the filter, so as to model the case of some users never reporting filter errors;
    • active on-line learning
      • the filter is allowed to request immediate feedback for a certain quota of messages which is considerably smaller than the total number.

Q: How are the above forms of feedback represented by the files in the dataset?

A: All the actual textual data are actually found in the trec06/data/**/* files

/trec06p
  /data
    /000
       /000
       ...
       /299
    ...
    /126
       /000
       /021

And for the rest of the directories, they are just a indices pointing to the subsets to emulate the different forms of evaluations.

Q: What does full mean in this case? (is it just the labels?)

  • trec06p/full/index: The index of email lists that points to all the data points in trec06p/data/**/*

Q: What is the delay for? (e.g. data-delay, full-delay)

  • trec06p/full-delay/index: The indices that points to the delayed feedback evaluation
    • trec06p/ham*-delay/index: The indices that points to only the non-spam labelled emails in the delayed feedback evaluation
    • trec06p/spam*-delay/index: The indices that points to only the spam labelled emails in the delayed feedback evaluation

So essentially, all the unique list of trec06p/ham*-delay/index + trec06p/spam*-delay/index = trec06p/full-delay/index

Q: Why is the data-delay folder empty?

For this, I don't have an answer... Got to ask the data provider/creator.

Q: Is there any special way to parse the contents in the data folder?

Now that's the fun coding part =)

Lets step back a little and think what we have essentially:

  • A list of emails in trec06/data/**/*
  • The spam/ham labels of each email in trec06/full/index
  • The Spam/SPAM/Ham/HAM labels of a subset of emails in trec06/full-delay/index

So...

import pandas as pd
from tqdm import tqdm


from lazyme import find_files


data_rows = {}

# Assuming you're on `trec06p` directory.
# P/S: you can use any other file path list function, 
# I just use lazyme.find_files because I find it convenient.
for fn in tqdm(find_files('./data/**/*')):
    if fn.endswith('.DS_Store'):
        continue
    # Note that not all files are in utf8/ascii charset 
    # so you'll have to read them in binary to store them.
    # Also note: THIS CAN BE DANGEROUS IF THERE'S EXCUTABLES IN THE DATA!!!
    # Assuming that there isn't.
    with open(fn, 'rb') as fin:
        data_id = tuple(fn.split('/')[-2:])
        data_rows[data_id] = fin.read()
        
full_labels = {}

with open('./full/index') as fin:
    for line in tqdm(fin):
        label, fn = line.strip().split()
        data_id = tuple(fn.split('/')[-2:])
        full_labels[data_id] = label
        
        
full_delay_labels = {}

with open('./full-delay/index') as fin:
    for line in tqdm(fin):
        label, fn = line.strip().split()
        data_id = tuple(fn.split('/')[-2:])
        # You'll realize that the labels repeated per data point.
        # but they are exactly the same.... -_-
        if data_id in full_delay_labels:
            assert label.lower() == full_delay_labels[data_id].lower()
        full_delay_labels[data_id] = label.lower()

Q: What is the difference between HAM, Ham, SPAM and Spam labels in the trec06p/*-delay/index

If we look carefully at the if data_id in full_delay_labels: assert label.lower() == full_delay_labels[data_id].lower() line, we see that all the caps and the non-caps labels are the same.

Q: So why is there a difference?

A: Not sure, best to ask data provider/creator

Q: Is there a difference between the labels from trec06p/full-delay/index and trec06p/full/index?

Don't seem like there's any.

>>> any(full_labels[data_id] != full_delay_labels[data_id] for data_id in full_labels)
False

Q: How do I just read it into a pandas dataframe?

Given what we know above:

import pandas as pd
from tqdm import tqdm


from lazyme import find_files


data_rows = {}

for fn in tqdm(find_files('./data/**/*')):
    if fn.endswith('.DS_Store'):
        continue
    with open(fn, 'rb') as fin:
        data_id = tuple(fn.split('/')[-2:])
        data_rows[data_id] = fin.read()

full_labels = {}

with open('./full/index') as fin:
    for line in tqdm(fin):
        label, fn = line.strip().split()
        data_id = tuple(fn.split('/')[-2:])
        full_labels[data_id] = label
        
df = pd.DataFrame({'binary':pd.Series(data_rows),'label':full_labels})

Q: But the input columns are still binaries, can I somehow guess the encoding?

Not really, it's pretty hard / messy to guess the encoding of a binary file but you can try this (though not all file specify charset=... in the content)

import re, mmap

def find_charset(fn):
    with open(fn, 'rb') as f:
        view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        return re.split(";|,|\n",
                        next(
                            re.finditer(br'charset\=([!-~\s]{%i,})\n' % 5, view)).group(1).decode('utf8')
                )[0].strip('"').strip("'")