Brought to you by molecularsciences.org.
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.
This publication may not be redistributed without this notice.

Natural Language Toolkit

Natural Language Toolkit (nltk) is a fantastic toolkit for research on natural languages. It is an open source toolkit written in python.

Installing nltk on ubuntu

To install nltk on ubuntu, run the following command:

$ sudo apt-get install python-nltk

To avoid problems later on, you should also install the following:

$ sudo apt-get install python-tk
$ sudo apt-get install python-numpy
$ sudo apt-get install python-matplotlib

Then we download data (books) we can work with.

$ python
>>> import nltk
>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> l
Packages:
  [ ] maxent_ne_chunker... ACE Named Entity Chunker (Maximum entropy)
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] brown............... Brown Corpus
...
Downloader> d
Download which package (l=list; x=cancel)?
  Identifier> all
...
     Done downloading collection 'all'
---------------------------------------------------------------------------
    d) Download      l) List      c) Config      h) Help      q) Quit
---------------------------------------------------------------------------
Downloader> q
True

The option "l" lists all available packages. d is the option to delete. The identifier "all" downloads all packages. The three dots (...) above indicate, text I did not copy from the command prompt to keep this page to a reasonable size.

Next, we need to import book modules into python

>>> from nltk.book import *

This would load several books into python. To print the title of a book

>>> text1

nltk reference

Before trying any of these commands, you need do type the following line, once:

from nltk.book import *

concordance
shows every occurrence of a given word in the text

text1.concordance("whales");

similar
words appearing in similar context

text1.similar("ocean");

common context
context shared by two or more words

text1.dispersion_plot(["ocean","whale"]);

count all characters in the text
counts all characters including spaces and punctuation in a text

len(text1);

counts all words

len(set(text1));

count the number of times a word appeared in the the text

text1.count("outset")