Data splits/organization
From NLPWiki
Data Set Organization
Each data set is formatted as follows:
There is a root directory which contains two subdirectories, one of these contains the actual data for the data set, and the other contains index files that describe a split. For example, the 20 Newsgroups data set has a root directory named newsgroupsCS601R which contains sub-directories groups and indices. The raw data for the data set is in groups, while indices contains subdirectories that each describe a portion of the data split: all, training, dev, and blind. For example, the indices/training/ directory contains files, one for each class, that list the files that belong to the training set in this particular split. Here is a snippet from one of those files indices/trainging/comp_graphics.txt:
groups/comp.graphics/38929 groups/comp.graphics/39063 groups/comp.graphics/38508 groups/comp.graphics/38481 groups/comp.graphics/38440 groups/comp.graphics/38526 groups/comp.graphics/38363 groups/comp.graphics/38576 groups/comp.graphics/38422 groups/comp.graphics/38902
note that these paths are all relative to the root directory of the data set (e.g., the newsgroupsCS601R directory).
Modifying Existing and Creating New Splits
You are welcome to create your own data splits for testing purposes. Sometimes it is helpful to train on very small sets during the development and debugging stages of classifier creation. One approach to this would be to make a copy of the current split:
cp -r indices small_split
then, create index files that are subsets of the existing index files. For example you might simply run a command like this:
head -n 100 small_split/training/comp_graphics.txt > small_split/training/comp_graphics.txt
which would reduce the file to contain just the first 100 lines of comp_graphics.txt.
To use your new split, simply change the value of the SPLIT parameter that you pass to the ant script. For example, instead of running:
ant lab1 -DDATA=/home/bob/newsgroupsCS601R -DSPLIT=/home/bob/newsgroupsCS601R/indices
you could run:
ant lab1 -DDATA=/home/bob/newsgroupsCS601R -DSPLIT=/home/bob/newsgroupsCS601R/small_split
