We can access the corpus as a list of words, or a list of sentences (where each sentence Mosteller: Probability with Statistical ApplicationsĮxample Document for Each Section of the Brown Corpus US Office of Civil and Defence Mobilization: The Family Fallout Shelter Underwood: Probing the Ethics of Realtors Have been categorized by genre, such as news, editorial, and so on. This corpus contains text from 500 sources, and the sources The Brown Corpus was the first million-word electronicĬorpus of English, created in 1961 at Brown University. The filename contains the date, chatroom,Īnd number of posts e.g., 10-19-20s_706posts.xml contains 706 posts gathered from The corpus is organized into 15 files, where each file contains several hundred postsĬollected on a given date, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a Names of the form "UserNNN", and manually edited to remove any other identifying information. The corpus contains over 10,000 posts, anonymized by replacing usernames with generic There is also a corpus of instant messaging chat sessions, originally collectedīy the Naval Postgraduate School for research on automatic detection of Internet predators. wine.txt Lovely delicate, fragrant Rhone wine. singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun. pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr. overheard.txt White guy: So, do you have any plans for this evening? Asian girl. grail.txt SCENE 1: KING ARTHUR: Whoa there! [clop. firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se. The sents() function divides the text up into its sentences, where each sentence is Tells us how many letters occur in the text, including the spaces between words. So, for example, len(gutenberg.raw( 'blake-poems.txt')) The raw() function gives us the contents of the file The previous example also showed how we can access the "raw" text of the book , (In fact, the average word length is reallyģ not 4, since the num_chars variable counts space characters.)īy contrast average sentence length and lexical diversityĪppear to be characteristics of particular authors. Observe that average word length appears to be a general property of English, since Item appears in the text on average (our lexical diversity score). This program displays three statistics for each text:Īverage word length, average sentence length, and the number of times each vocabulary 5 25 26 austen-emma.txt 5 26 17 austen-persuasion.txt 5 28 22 austen-sense.txt 4 34 79 bible-kjv.txt 5 19 5 blake-poems.txt 4 19 14 bryant-stories.txt 4 18 12 burgess-busterbrown.txt 4 20 13 carroll-alice.txt 5 20 12 chesterton-ball.txt 5 23 11 chesterton-brown.txt 5 18 11 chesterton-thursday.txt 4 21 25 edgeworth-parents.txt 5 26 15 melville-moby_dick.txt 5 52 11 milton-paradise.txt 4 12 9 shakespeare-caesar.txt 4 12 8 shakespeare-hamlet.txt 4 12 7 shakespeare-macbeth.txt 5 36 12 whitman-leaves.txt print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid) num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |