######################################################################### ############## Computational Linguistics Toolset v1.1.5 ################# ######################################################################### ######## Copyright (C) 2005 Wybo Wiersma ######## ######################################################################### # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA ######################################################################### # It is kindly requested that you acknowledge the use of these tools in # the publications reporting results produced with the help of them. ### What it is ### The Computational Linguistics Toolset is a set of tools for computational linguistics. It contains re-usable code for cleaning, splitting, refining, and taking samples from corpora (ICE, Penn, and a native one), for tagging them using the TnT-tagger, for doing permutation statistics on N-grams (useful for finding statistically significant syntactical differences between any two sets of tagged texts), and various examination-tools. The tools themselves are well documented. The individual tools are documented and versioned separately. Each time I make a significant new release (this includes bugfixes) of the entire package I will increase the version-number above. ### What is required ### Perl (tested here on 5.8.4) The following modules are needed but come standard with Perl: FileHandle FindBin List::Util File::Basename Compress::Zlib The least thing you need to do is set the $configbasedir variable within the central config file (named config) ### What to find where ### The dir-structure of the package is as follows: tools/corpus/ the tools for preparing corpora tools/examine/ tools for examining tools/sensing/ tools for doing WordNet-related research tools/tagging/ tools for tagging tools/permstat/ tools for doing permutation-statistics Two special dirs are: tools/mess/ quick-hack scripts that have little general usage tools/export/ tools for exporting (tarring & publishing these tools) This structure is not guaranteed to remain the same forever... ### How to use it (in the easiest way) ### To get more info on what a script does, run it with the -? option. The *goall-scripts are used to do runs in which the tools are chained together, to allow adding corpora, or doing many tasks in sequence. To use the tools within a sequence without changing anything to the configuration-files you should follow the following instructions 1 The tools-dir should be unpacked inside another dir, for example: research/ 2 The raw corpus should be stored in the following dir within this base-dir (research/)corpusData//raw 3 Some tools need lists of some sort (like corpuslexiconreducer.pl). Those should be stored in (research/)taskData/lists 4 Other dirs like corpusData, and dirs within corpusData/ (for example the cleaned/ dir) are created automatically by the tools when needed. You can always modify the *goall scripts to suit the needs of your particular research. However things might change between versions of this tool-package, so have a look at the changelog before overwriting your current install. Better even; drop me a note if you are using this toolset, so I can keep an eye on possible update-problems (although of course I cannot accept any formal liabillity)... ### Changelog 1.1.4 -> 1.1.5 Added the CorpusTagsetReducer tool to the corpus task-set: corpus/corpustagsetreducer.pl The RowChecker, TableScaler & TableTurner tools were added to the examine-set: examine/rowchecker.pl examine/tablescaler.rb examine/tableturner.rb Goall-tools were added to the examine, and sensing sets. Several smaller fixes and additions. 1.1.3 -> 1.1.4 Made compression default for NgramPermutator and the PermutationStatter and removed it as an option. Fixed a bug in the compression of NgramPermutator that prevented the creation of data since version 1.1.2. 1.1.2 -> 1.1.3 Full support for the manual n-gram search function (-n option) was added to Tag Sample Finder. 1.1.1 -> 1.1.2 Added PermStatResultSelector as a proper tool Added multiple ipnorm normalization rounds for extra precision Fixed a few minor bugs 1.1.0 -> 1.1.1 Fixed and updated (sentence-length counting): examine/rowstatter.pl Also added some library functions. 1.0.5 -> 1.1.0 Added the following tools: corpus/corpus2tagrow.pl corpus/corpusrewritetagrow.pl sensing/sensinggoall.pl sensing/sentencesenser.pl sensing/semanticgravitor.pl Updated: sensing/wordcombinationfinder.pl - bug fixed that caused some word-combinations not to be found - changed the default window-size to 5 sensing/listsenser.pl - implemented the option for using an existing database - changed the database-format to cdb