What is the TREC 2006 Spam Track Public Corpora Format?

Question

What is the TREC 2006 Spam Track Public Corpora Format?

44 Views Asked by Manish Joyeuse At 04 February 2024 at 10:58

link to original dataset
I have downloaded this dataset The TREC 2006 Public Corpus -- 75MB (trec06p.tgz). Here is the folder structure:

.
└── trec 06p/
    ├── data
    ├── data-delay
    ├── full
    ├── full-delay
    ├── ham25
    ├── ham25-delay
    ├── ham50
    ├── ham50-delay
    ├── spam25
    ├── spam25-delay
    ├── spam50
    └── spam50-delay

Some questions:

What is the delay for? (e.g. data-delay, full-delay)
What does full mean in this case? (is it just the labels?)
What is the difference between HAM and ham in the full-delay subfolder?
Why is the data-delay folder empty?
Is there any special way to parse the contents in the data folder?

Original Q&A

There are 1 best solutions below

**alvas** · Accepted Answer · 2024-02-07T18:21:29.670000

Disclaimer

Before reading the answer, please note that since I had not participated in the TREC06 task nor am I the data creator/provider, I can do only some educated guess to the questions you have on the dataset.

Educated Guessed Answers

First, reading the task paper helps https://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW16.pdf =)

Next, the right download link for future readers would be https://plg.uwaterloo.ca/~gvcormac/treccorpus06/

And now, some summary:

TREC 2006 Spam Track dataset is "a set of chronologically ordered email messages a spam filter for classification"
Four different forms of user feedback are modeled with
- immediate feedback
  - the gold standard for each message is communicated to the filter immediately following classification;
- delayed feedback
  - the gold standard is communicated to the filter sometime later (or potentially never), so as to model a user reading email from time to time and perhaps not diligently reporting the filter’s errors;
- partial feedback
  - the gold standard for only a subset of email recipients is transmitted to the filter, so as to model the case of some users never reporting filter errors;
- active on-line learning
  - the filter is allowed to request immediate feedback for a certain quota of messages which is considerably smaller than the total number.

Q: How are the above forms of feedback represented by the files in the dataset?

A: All the actual textual data are actually found in the trec06/data/**/* files

/trec06p
  /data
    /000
       /000
       ...
       /299
    ...
    /126
       /000
       /021

And for the rest of the directories, they are just a indices pointing to the subsets to emulate the different forms of evaluations.

Q: What does full mean in this case? (is it just the labels?)

trec06p/full/index: The index of email lists that points to all the data points in trec06p/data/**/*

Q: What is the delay for? (e.g. data-delay, full-delay)

trec06p/full-delay/index: The indices that points to the delayed feedback evaluation
- trec06p/ham*-delay/index: The indices that points to only the non-spam labelled emails in the delayed feedback evaluation
- trec06p/spam*-delay/index: The indices that points to only the spam labelled emails in the delayed feedback evaluation

So essentially, all the unique list of trec06p/ham*-delay/index + trec06p/spam*-delay/index = trec06p/full-delay/index

Q: Why is the data-delay folder empty?

For this, I don't have an answer... Got to ask the data provider/creator.

Q: Is there any special way to parse the contents in the data folder?

Now that's the fun coding part =)

Lets step back a little and think what we have essentially:

A list of emails in trec06/data/**/*
The spam/ham labels of each email in trec06/full/index
The Spam/SPAM/Ham/HAM labels of a subset of emails in trec06/full-delay/index

So...

import pandas as pd
from tqdm import tqdm


from lazyme import find_files


data_rows = {}

# Assuming you're on `trec06p` directory.
# P/S: you can use any other file path list function, 
# I just use lazyme.find_files because I find it convenient.
for fn in tqdm(find_files('./data/**/*')):
    if fn.endswith('.DS_Store'):
        continue
    # Note that not all files are in utf8/ascii charset 
    # so you'll have to read them in binary to store them.
    # Also note: THIS CAN BE DANGEROUS IF THERE'S EXCUTABLES IN THE DATA!!!
    # Assuming that there isn't.
    with open(fn, 'rb') as fin:
        data_id = tuple(fn.split('/')[-2:])
        data_rows[data_id] = fin.read()
        
full_labels = {}

with open('./full/index') as fin:
    for line in tqdm(fin):
        label, fn = line.strip().split()
        data_id = tuple(fn.split('/')[-2:])
        full_labels[data_id] = label
        
        
full_delay_labels = {}

with open('./full-delay/index') as fin:
    for line in tqdm(fin):
        label, fn = line.strip().split()
        data_id = tuple(fn.split('/')[-2:])
        # You'll realize that the labels repeated per data point.
        # but they are exactly the same.... -_-
        if data_id in full_delay_labels:
            assert label.lower() == full_delay_labels[data_id].lower()
        full_delay_labels[data_id] = label.lower()

Q: What is the difference between HAM, Ham, SPAM and Spam labels in the `trec06p/*-delay/index`

If we look carefully at the if data_id in full_delay_labels: assert label.lower() == full_delay_labels[data_id].lower() line, we see that all the caps and the non-caps labels are the same.

Q: So why is there a difference?

A: Not sure, best to ask data provider/creator

Q: Is there a difference between the labels from `trec06p/full-delay/index` and `trec06p/full/index`?

Don't seem like there's any.

>>> any(full_labels[data_id] != full_delay_labels[data_id] for data_id in full_labels)
False

Q: How do I just read it into a pandas dataframe?

Given what we know above:

import pandas as pd
from tqdm import tqdm


from lazyme import find_files


data_rows = {}

for fn in tqdm(find_files('./data/**/*')):
    if fn.endswith('.DS_Store'):
        continue
    with open(fn, 'rb') as fin:
        data_id = tuple(fn.split('/')[-2:])
        data_rows[data_id] = fin.read()

full_labels = {}

with open('./full/index') as fin:
    for line in tqdm(fin):
        label, fn = line.strip().split()
        data_id = tuple(fn.split('/')[-2:])
        full_labels[data_id] = label
        
df = pd.DataFrame({'binary':pd.Series(data_rows),'label':full_labels})

Q: But the input columns are still binaries, can I somehow guess the encoding?

Not really, it's pretty hard / messy to guess the encoding of a binary file but you can try this (though not all file specify charset=... in the content)

import re, mmap

def find_charset(fn):
    with open(fn, 'rb') as f:
        view = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        return re.split(";|,|\n",
                        next(
                            re.finditer(br'charset\=([!-~\s]{%i,})\n' % 5, view)).group(1).decode('utf8')
                )[0].strip('"').strip("'")

What is the TREC 2006 Spam Track Public Corpora Format?

There are 1 best solutions below

Disclaimer

Educated Guessed Answers

Q: How are the above forms of feedback represented by the files in the dataset?

Q: What does full mean in this case? (is it just the labels?)

Q: What is the delay for? (e.g. data-delay, full-delay)

Q: Why is the data-delay folder empty?

Q: Is there any special way to parse the contents in the data folder?

Q: What is the difference between HAM, Ham, SPAM and Spam labels in the `trec06p/*-delay/index`

Q: Is there a difference between the labels from `trec06p/full-delay/index` and `trec06p/full/index`?

Q: How do I just read it into a pandas dataframe?

Q: But the input columns are still binaries, can I somehow guess the encoding?

Related Questions in NLP

Related Questions in SPAM-PREVENTION

Trending Questions

Popular # Hahtags

Popular Questions

What is the TREC 2006 Spam Track Public Corpora Format?

There are 1 best solutions below

Disclaimer

Educated Guessed Answers

Q: How are the above forms of feedback represented by the files in the dataset?

Q: What does full mean in this case? (is it just the labels?)

Q: What is the delay for? (e.g. data-delay, full-delay)

Q: Why is the data-delay folder empty?

Q: Is there any special way to parse the contents in the data folder?

Q: What is the difference between HAM, Ham, SPAM and Spam labels in the trec06p/*-delay/index

Q: Is there a difference between the labels from trec06p/full-delay/index and trec06p/full/index?

Q: How do I just read it into a pandas dataframe?

Q: But the input columns are still binaries, can I somehow guess the encoding?

Related Questions in NLP

Related Questions in SPAM-PREVENTION

Trending Questions

Popular # Hahtags

Popular Questions

Q: What is the difference between HAM, Ham, SPAM and Spam labels in the `trec06p/*-delay/index`

Q: Is there a difference between the labels from `trec06p/full-delay/index` and `trec06p/full/index`?