Text Processing with NLTK

A hands-on walkthrough of core NLP text processing techniques using NLTK - from tokenization and stemming to named entity recognition.

Prerequisites

The code snippets on this page require the NLTK Python library. Install it with:

pip install nltk

Setup

NLTK ships with optional data packages (corpora, models, tokenizers) that you download once before use. Each package serves a specific purpose:

import nltk

nltk.download('punkt')                         # Pre-trained tokenizer models
nltk.download('stopwords')                     # Common stop words in various languages
nltk.download('wordnet')                       # Lexical database for synonym finding and word sense disambiguation
nltk.download('averaged_perceptron_tagger')    # Pre-trained part-of-speech tagger models
nltk.download('maxent_ne_chunker')             # Named entity chunker for people, organizations, locations
nltk.download('punkt_tab')                     # Tokenizer models for tab-separated text
nltk.download('averaged_perceptron_tagger_eng')# POS tagger specifically tuned for English
nltk.download('maxent_ne_chunker_tab')         # Named entity chunker for tab-separated text
nltk.download('words')                         # Corpus of English words for spell checking and validation

Tokenization

A computer cannot understand raw text directly - it is just a sequence of characters. The first step in any NLP pipeline is breaking text into manageable pieces. Tokenization splits text into sentences (the chapters of a book) and words (the individual building blocks), so we can analyze it further.

import nltk

data = (
    "Apple Inc. is an American multinational technology company "
    "headquartered in Cupertino, California. "
    "It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976. "
    "Apple is known for its innovative products, "
    "including the iPhone, iPad, and Mac computers."
)

# Sentence tokenization - splits text into a list of sentences
print(nltk.sent_tokenize(data))

Output:

['Apple Inc. is an American multinational technology company headquartered in Cupertino, California.',
 'It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.',
 'Apple is known for its innovative products, including the iPhone, iPad, and Mac computers.']

# Word tokenization - splits text into a list of individual words
print(nltk.word_tokenize(data))

Output:

['Apple', 'Inc.', 'is', 'an', 'American', 'multinational', 'technology', 'company',
 'headquartered', 'in', 'Cupertino', ',', 'California', '.', 'It', 'was', 'founded',
 'by', 'Steve', 'Jobs', ',', 'Steve', 'Wozniak', ',', 'and', 'Ronald', 'Wayne', 'in',
 '1976', '.', 'Apple', 'is', 'known', 'for', 'its', 'innovative', 'products', ',',
 'including', 'the', 'iPhone', ',', 'iPad', ',', 'and', 'Mac', 'computers', '.']

Stemming

Stemming reduces words to their root form by applying simple rule-based transformations. It is fast but crude - stemmers do not consider meaning, only pattern matching on suffixes.

from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

words = ["running", "runner", "ran", "easily", "fairly"]

def stem_words(words, stemmer, name):
    print(f"\n{name}")
    print("-" * 20)
    for w in words:
        print(f"{w} -> {stemmer.stem(w)}")

stem_words(words, porter_stemmer, "Porter Stemmer")
stem_words(words, lancaster_stemmer, "Lancaster Stemmer")
stem_words(words, snowball_stemmer, "Snowball Stemmer")

Output:

Porter Stemmer
--------------------
running -> run
runner -> runner
ran -> ran
easily -> easili
fairly -> fairli

Lancaster Stemmer
--------------------
running -> run
runner -> run
ran -> ran
easily -> easy
fairly -> fair

Snowball Stemmer
--------------------
running -> run
runner -> runner
ran -> ran
easily -> easili
fairly -> fair

Porter Stemmer is the most common algorithm - simple, efficient, and widely used. Lancaster Stemmer is more aggressive, sometimes producing shorter but less recognizable stems. Snowball Stemmer is an improved version of Porter with better accuracy and multi-language support.

When to use stemming

Search engines
Log analysis and indexing
Keyword grouping

When to avoid stemming

Chatbots
RAG pipelines
Semantic search - use lemmatization instead

Lemmatization

Lemmatization is a more sophisticated process than stemming. Instead of blindly chopping suffixes, it considers the context and part of speech of the word to determine its proper base form (lemma). It is more accurate but also more computationally expensive.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran", "easily", "fairly"]

def lemmatize_words(words, pos="n"):
    print(f"\nLemmatization (POS={pos})")
    print("-" * 25)
    for w in words:
        print(f"{w} -> {lemmatizer.lemmatize(w, pos)}")

lemmatize_words(words, pos="n")  # nouns
lemmatize_words(words, pos="v")  # verbs
lemmatize_words(words, pos="a")  # adjectives

Output:

Lemmatization (POS=n)
-------------------------
running -> running
runner -> runner
ran -> ran
easily -> easily
fairly -> fairly

Lemmatization (POS=v)
-------------------------
running -> run
runner -> runner
ran -> run
easily -> easily
fairly -> fairly

Lemmatization (POS=a)
-------------------------
running -> running
runner -> runner
ran -> ran
easily -> easily
fairly -> fairly

Notice how lemmatization with POS=v (verbs) correctly maps "running" and "ran" back to "run", while the noun and adjective modes leave them unchanged. The part of speech you provide directly affects the result, which is why POS tagging (covered below) pairs so well with lemmatization.

Stop Words

Stop words are common words like "the", "is", "in", and "and" that carry little meaning on their own. Removing them reduces noise and dimensionality, which can improve the performance of NLP models.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)

Output (truncated):

{"won't", 'with', 'his', 'against', "you're", 'me', 'after', 'there',
 "hadn't", 'out', "we've", 'which', "hasn't", 'on', 'those', 'under',
 'were', 'because', 'weren', 've', 'yourselves', 'having', 'some',
 "i've", 'does', 'above', 'ma', 'over', 'their', 'himself', "shan't",
 ...}

Part-of-Speech Tagging

POS tagging assigns a grammatical category (noun, verb, adjective, etc.) to each word in a sentence. This is essential for understanding meaning - for example, the word "bark" can be a noun (the sound a dog makes) or a verb (to make a sound). The tagger resolves this based on context.

NLTK uses a model trained on the Penn Treebank (Wall Street Journal corpus).

import nltk

sentence = "Apple Inc. is an American multinational technology company headquartered in Cupertino, California."
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)

print(tagged)

Output:

[('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('an', 'DT'),
 ('American', 'JJ'), ('multinational', 'NN'), ('technology', 'NN'),
 ('company', 'NN'), ('headquartered', 'VBD'), ('in', 'IN'),
 ('Cupertino', 'NNP'), (',', ','), ('California', 'NNP'), ('.', '.')]

Common POS tags: NNP = proper noun, VBZ = verb (3rd person singular present), DT = determiner, JJ = adjective, NN = noun, IN = preposition.

See the full list of POS tags: Penn Treebank POS Tags

Putting It All Together: Text Cleaning Pipeline

A practical text cleaning function that combines lowercasing, tokenization, punctuation removal, and stop word filtering:

import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def clean_text(text: str) -> list[str]:
    punct = string.punctuation
    stop_words = set(stopwords.words("english"))

    text = text.lower()
    tokens = word_tokenize(text)

    words = []
    for word in tokens:
        word = word.strip(punct)
        if word and word not in stop_words:
            words.append(word)

    return words

Running it on a sample paragraph:

paragraph = """
Osmani is building an AI system that helps companies understand their data faster.
He was running multiple experiments, analyzing results, and improving the models every day.
The system connects users, generates insights, and simplifies complex workflows for developers and analysts.
"""

clean_paragraph = clean_text(paragraph)
print(clean_paragraph)

Output:

['osmani', 'building', 'ai', 'system', 'helps', 'companies', 'understand',
 'data', 'faster', 'running', 'multiple', 'experiments', 'analyzing',
 'results', 'improving', 'models', 'every', 'day', 'system', 'connects',
 'users', 'generates', 'insights', 'simplifies', 'complex', 'workflows',
 'developers', 'analysts']

All filler words are gone and the remaining tokens capture the actual meaning of the text.

Named Entity Recognition (NER)

NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, and dates. NLTK combines POS tagging with chunking to detect these entities.

import nltk

pos_tags = nltk.pos_tag(clean_paragraph)
named_entities = nltk.ne_chunk(pos_tags)

for chunk in named_entities:
    if hasattr(chunk, 'label'):
        inside = " ".join(f"{word}/{tag}" for word, tag in chunk)
        print(f"  ({chunk.label()} {inside})")
    else:
        word, tag = chunk
        print(f"  {word}/{tag}")

The output labels each token with its POS tag, and groups recognized entities under labels like PERSON, ORGANIZATION, or GPE (geo-political entity).

Key Takeaways

Tokenization is always the first step - split raw text into sentences and words before anything else
Stemming is fast and rule-based but crude; good for search and indexing
Lemmatization is context-aware and more accurate; preferred for semantic tasks
Stop word removal cuts noise by dropping frequent low-information words
POS tagging reveals grammatical structure and resolves word ambiguity
NER extracts structured entities (people, places, organizations) from unstructured text
Combine these techniques into a cleaning pipeline before feeding text to downstream models

Prerequisites​

Setup​

Tokenization​

Stemming​

When to use stemming​

When to avoid stemming​

Lemmatization​

Stop Words​

Part-of-Speech Tagging​

Putting It All Together: Text Cleaning Pipeline​

Named Entity Recognition (NER)​

Key Takeaways​

Prerequisites

Setup

Tokenization

Stemming

When to use stemming

When to avoid stemming

Lemmatization

Stop Words

Part-of-Speech Tagging

Putting It All Together: Text Cleaning Pipeline

Named Entity Recognition (NER)

Key Takeaways