Natural Language Toolkit (nltk) is a fantastic toolkit for research on natural languages. It is an open source toolkit written in python.
To install nltk on ubuntu, run the following command:
$ sudo apt-get install python-nltk
To avoid problems later on, you should also install the following:
$ sudo apt-get install python-tk $ sudo apt-get install python-numpy $ sudo apt-get install python-matplotlib
Then we download data (books) we can work with.
$ python
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
d) Download l) List c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> l
Packages:
[ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
[ ] abc................. Australian Broadcasting Commission 2006
[ ] brown............... Brown Corpus
...
Downloader> d
Download which package (l=list; x=cancel)?
Identifier> all
...
Done downloading collection 'all'
---------------------------------------------------------------------------
d) Download l) List c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> q
True
The option "l" lists all available packages. d is the option to delete. The identifier "all" downloads all packages. The three dots (...) above indicate, text I did not copy from the command prompt to keep this page to a reasonable size.
Next, we need to import book modules into python
>>> from nltk.book import *
This would load several books into python. To print the title of a book
>>> text1
Before trying any of these commands, you need do type the following line, once:
from nltk.book import *
concordance
shows every occurrence of a given word in the text
text1.concordance("whales");
similar
words appearing in similar context
text1.similar("ocean");
common context
context shared by two or more words
text1.dispersion_plot(["ocean","whale"]);
count all characters in the text
counts all characters including spaces and punctuation in a text
len(text1);
counts all words
len(set(text1));
count the number of times a word appeared in the the text
text1.count("outset")