Assignment Set 10 (Uploaded on November 09, 2015) Clarification deadline date: November 16, 2015 Submission deadline date: November 23, 2015 ------------------------------------------------------------------------------- Problem 1 [Create a Term-Document Matrix] ------------------------------------------------------------------------------- Background: One may represent a book as a vector of words, taken from a set dictionary, and their respective frequencies of occurence in the book. For example, if there are three books: book01 = 'John and Bob are brothers.' book02 = 'John went to the store. The store was closed.' book03 = 'Bob went to the store too.' and the dictionary is specified as the union of the unique words that occur in these three books, then one may write each book as a term-vector, as follows. tvec01 = [1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0] tvec02 = [0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0] tvec03 = [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1] where the dictionary, that is, the union of unique words in these three books, looks like: ['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too'] The term-document matrix is the matrix formed by combining the above term-vectors as rows to form an N x M matrix for N books and M words/terms in total, combining all the books. Note that one may use the English dictionary /usr/share/dict/words as well. That may however result in a lot of 0's in the above vectors, and the matrix may be quite sparse. Prerequisites: -- Learn how to read from a text file and write into a file in Python. Use online references or tutorials, read "Learn Python the Hard Way" or follow the "Google Class". -- Visit the "Project Gutenberg" website (https://www.gutenberg.org/) and download Plain Text UTF-8 versions of 10 books of your choice, and store them in a folder named "books" as "book01.txt", "book02.txt", . . . , "book10.txt". Task: Write a Python program that takes as input the path to the folder "books", and produces the term document matrix for the books in that specific folder, in two different ways: (a) Create the term-document matrix where the dictionary is the union of unique words that occur in the books that you take as input from the "books" folder. (b) Create the term-document matrix where the dictionary is the set of unique words in the built-in dictionary stored in your system at "/usr/share/dict/words". Thus, your output should be TWO SEPARATE TEXT (CSV) FILES, as follows: -- Output of (a) : 10 x M matrix, written in a comma-separated value (CSV) format in a text file named "td_matrix_a.txt", with the 10 rows representing the term-vectors corresponding to the 10 books in "books" folder, and the M columns representing the M unique terms that occur in the books that you take as input. -- Output of (b) : 10 x N matrix, written in a comma-separated value (CSV) format in a text file named "td_matrix_b.txt", with the 10 rows representing the term-vectors corresponding to the 10 books in "books" folder, and the N columns representing the N unique terms that occur in the built-in dictionary "/usr/share/dict/words". Make sure that you clean the words in the books, so that there is no punctuation marks and no special characters within the unique words/terms that you obtain for the dictionary. Make sure that there are no numerals in the dictionary -- only english letters. Also make sure to get rid of plurals in the whole document, so that the dictionary only contains words/terms which are singular -- e.g., "boys" should be replaced by "boy", "pansies" should be replaced by "pansy", etc. The code should be executable as: python cs15xx-assign10-prog1.py books td_matrix_a.txt td_matrix_b.txt Take command-line inputs (import sys) for paths of "books" folder and the output files. [Marks: 20 {reading words/terms} + 20 {cleaning the words/terms} + 40 {creating the matrices} + 20 {good coding practice} = 100] --------------------------------------------------------------- The name of the program file should be "cs15xx-assign10-prog1.py". Copy only the program file "cs15xx-assign10-prog1.py" in ~pdslab/2015/assign10/. DO NOT copy your "books" folder. Your code will be tested with our own "books". At the top of your program file, add the following. /*------------------------------------------------------------------ Name: Roll Number: Date of Submission: Deadline date: Program description: Acknowledgements: --------------------------------------------------------------------*/