Data splits/organization

From NLPWiki

Jump to: navigation, search

Data Set Organization

Each data set is formatted as follows:

There is a root directory which contains two subdirectories, one of these contains the actual data for the data set, and the other contains index files that describe a split. For example, the 20 Newsgroups data set has a root directory named newsgroupsCS601R which contains sub-directories groups and indices. The raw data for the data set is in groups, while indices contains subdirectories that each describe a portion of the data split: all, training, dev, and blind. For example, the indices/training/ directory contains files, one for each class, that list the files that belong to the training set in this particular split. Here is a snippet from one of those files indices/trainging/comp_graphics.txt:

groups/comp.graphics/38929
groups/comp.graphics/39063
groups/comp.graphics/38508
groups/comp.graphics/38481
groups/comp.graphics/38440
groups/comp.graphics/38526
groups/comp.graphics/38363
groups/comp.graphics/38576
groups/comp.graphics/38422
groups/comp.graphics/38902

note that these paths are all relative to the root directory of the data set (e.g., the newsgroupsCS601R directory).

Modifying Existing and Creating New Splits

You are welcome to create your own data splits for testing purposes. Sometimes it is helpful to train on very small sets during the development and debugging stages of classifier creation. One approach to this would be to make a copy of the current split:

cp -r indices small_split

then, create index files that are subsets of the existing index files. For example you might simply run a command like this:

head -n 100 small_split/training/comp_graphics.txt > small_split/training/comp_graphics.txt

which would reduce the file to contain just the first 100 lines of comp_graphics.txt.

To use your new split, simply change the value of the SPLIT parameter that you pass to the ant script. For example, instead of running:

ant lab1 -DDATA=/home/bob/newsgroupsCS601R -DSPLIT=/home/bob/newsgroupsCS601R/indices

you could run:

ant lab1 -DDATA=/home/bob/newsgroupsCS601R -DSPLIT=/home/bob/newsgroupsCS601R/small_split
Personal tools