Word Frequency Software This suite of Perl scripts prepares statistical summaries of word frequency profiles for each individual text, all texts within each manuscript, and all texts, individually and jointly, within the poetry and prose classes. These scripts were used solely to prepare the corpus for our online tools. See below (Analysis Software) for a suite of scripts to facilitate the running of your own experiments. The results of this software can be viewed online by using our online Tools.
1_sort_Into_Directories.zip - ReadMe
Starting with a directory of SGML-encoded texts, sorts texts into a new directory hierarchy according to the class/genre (A_Poetry vs. B_Prose) and manuscript names (A1, A2, ... B1, etc).
2_countWords.zip - ReadMe
Starting with each text in the directory hierarchy from the previous program (1_sort_Into_Directories.pl), counts the number of times each word appears, the number of unique words, and the total number of words for each text. The output from this script can be viewed online and is available for download as a zipped Excel file (see Tools).
3_generate_stats.zip - ReadMe
Starting with the word counts produced above (2_countWords.pl), creates separate statistical summaries of word frequency profiles for: (i) all texts within each manuscript, (ii) the entire collection of poems, (iii) the entire collection of prose, and (iv) the combined corpus of poems and prose. Additional summary information includes average word usage and hapax legomena across manuscripts, classes(genres), and the corpus of poems and prose.
4_merge_corpus.zip - ReadMe 4a - ReadMe 4b
This combination of scripts loads the word frequencies for all words for all texts in the corpus (poetry and prose) and creates results in one comma-delimited file. For each word across all texts, the following statistics are generated: minimum, mean, median, maximum, and standard deviation.

Analysis Software (as) This suite of software morphs data into needed formats in preparation for your experimental analyses of texts, including statistical summaries of word usage across select groups (or chunks) of texts, authorship attribution techniques, and clustering and classification methods.
as1_sort_Into_Directories.zip - ReadMe
Starting with a directory of SGML-encoded texts, sorts texts into a new directory hierarchy according to the class/genre (A_Poetry vs. B_Prose) and manuscript names (A1, A2, ... B1, etc).
as2_cutter.zip - ReadMe
This handy script "cuts" texts into user-specified chunks. For example, you could cut the poem of Daniel into ten 450 words chunks; subsequent scripts will treat each of these chunks as an independent text.
as3_countWords.zip - ReadMe
This script counts the number of words in each file. Input files are organized in subfolders, e.g., by author or by textname if the text was split into chunks using a2_cutter.pl (see above).
as4a_mergeWordCounts.zip - ReadMe
This script can be used either after you've created a Virtual Manuscript or following as3_countWords. The main goal of this script, in addition to collecting some statistics on your collection of texts, is to merge the counts into one file in preparation for further analysis, for example, in R (see below).
as4b_getStats_prepare4R.zip - ReadMe
This script is a follow-up to as4a. In short, this script generates additional statistical reports and rotates the merged data (rows to columns) for follow-up analyses in R (see below).
kick_as5_R
We are presently working on a set of scripts in R to perform cluster, classification, and other analyses of sets of texts.
as6_DeltaBurrows.pl
We are presently working on a set of scripts in Perl to perform variations of Burrow's DELTA authorship attribution technique.
Home | Contact Us | © 2009 Wheaton Lexomics Research Group, Norton MA | LAST MODIFIED: Wed. May 13, 2009 11:07 AM EDT