Text Processing with NLTK
A hands-on walkthrough of core NLP text processing techniques using NLTK - from tokenization and stemming to named entity recognition.
Prerequisites
The code snippets on this page require the NLTK Python library. Install it with:
pip install nltk
Setup
NLTK ships with optional data packages (corpora, models, tokenizers) that you download once before use. Each package serves a specific purpose:
import nltk
nltk.download('punkt') # Pre-trained tokenizer models
nltk.download('stopwords') # Common stop words in various languages
nltk.download('wordnet') # Lexical database for synonym finding and word sense disambiguation
nltk.download('averaged_perceptron_tagger') # Pre-trained part-of-speech tagger models
nltk.download('maxent_ne_chunker') # Named entity chunker for people, organizations, locations
nltk.download('punkt_tab') # Tokenizer models for tab-separated text
nltk.download('averaged_perceptron_tagger_eng')# POS tagger specifically tuned for English
nltk.download('maxent_ne_chunker_tab') # Named entity chunker for tab-separated text
nltk.download('words') # Corpus of English words for spell checking and validation
Tokenization
A computer cannot understand raw text directly - it is just a sequence of characters. The first step in any NLP pipeline is breaking text into manageable pieces. Tokenization splits text into sentences (the chapters of a book) and words (the individual building blocks), so we can analyze it further.
import nltk
data = (
"Apple Inc. is an American multinational technology company "
"headquartered in Cupertino, California. "
"It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976. "
"Apple is known for its innovative products, "
"including the iPhone, iPad, and Mac computers."
)
# Sentence tokenization - splits text into a list of sentences
print(nltk.sent_tokenize(data))
Output:
['Apple Inc. is an American multinational technology company headquartered in Cupertino, California.',
'It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in 1976.',
'Apple is known for its innovative products, including the iPhone, iPad, and Mac computers.']
# Word tokenization - splits text into a list of individual words
print(nltk.word_tokenize(data))
Output:
['Apple', 'Inc.', 'is', 'an', 'American', 'multinational', 'technology', 'company',
'headquartered', 'in', 'Cupertino', ',', 'California', '.', 'It', 'was', 'founded',
'by', 'Steve', 'Jobs', ',', 'Steve', 'Wozniak', ',', 'and', 'Ronald', 'Wayne', 'in',
'1976', '.', 'Apple', 'is', 'known', 'for', 'its', 'innovative', 'products', ',',
'including', 'the', 'iPhone', ',', 'iPad', ',', 'and', 'Mac', 'computers', '.']
Stemming
Stemming reduces words to their root form by applying simple rule-based transformations. It is fast but crude - stemmers do not consider meaning, only pattern matching on suffixes.
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')
words = ["running", "runner", "ran", "easily", "fairly"]
def stem_words(words, stemmer, name):
print(f"\n{name}")
print("-" * 20)
for w in words:
print(f"{w} -> {stemmer.stem(w)}")
stem_words(words, porter_stemmer, "Porter Stemmer")
stem_words(words, lancaster_stemmer, "Lancaster Stemmer")
stem_words(words, snowball_stemmer, "Snowball Stemmer")
Output:
Porter Stemmer
--------------------
running -> run
runner -> runner
ran -> ran
easily -> easili
fairly -> fairli
Lancaster Stemmer
--------------------
running -> run
runner -> run
ran -> ran
easily -> easy
fairly -> fair
Snowball Stemmer
--------------------
running -> run
runner -> runner
ran -> ran
easily -> easili
fairly -> fair
Porter Stemmer is the most common algorithm - simple, efficient, and widely used. Lancaster Stemmer is more aggressive, sometimes producing shorter but less recognizable stems. Snowball Stemmer is an improved version of Porter with better accuracy and multi-language support.
When to use stemming
- Search engines
- Log analysis and indexing
- Keyword grouping
When to avoid stemming
- Chatbots
- RAG pipelines
- Semantic search - use lemmatization instead
Lemmatization
Lemmatization is a more sophisticated process than stemming. Instead of blindly chopping suffixes, it considers the context and part of speech of the word to determine its proper base form (lemma). It is more accurate but also more computationally expensive.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "runner", "ran", "easily", "fairly"]
def lemmatize_words(words, pos="n"):
print(f"\nLemmatization (POS={pos})")
print("-" * 25)
for w in words:
print(f"{w} -> {lemmatizer.lemmatize(w, pos)}")
lemmatize_words(words, pos="n") # nouns
lemmatize_words(words, pos="v") # verbs
lemmatize_words(words, pos="a") # adjectives
Output:
Lemmatization (POS=n)
-------------------------
running -> running
runner -> runner
ran -> ran
easily -> easily
fairly -> fairly
Lemmatization (POS=v)
-------------------------
running -> run
runner -> runner
ran -> run
easily -> easily
fairly -> fairly
Lemmatization (POS=a)
-------------------------
running -> running
runner -> runner
ran -> ran
easily -> easily
fairly -> fairly
Notice how lemmatization with POS=v (verbs) correctly maps "running" and "ran" back to "run", while the noun and adjective modes leave them unchanged. The part of speech you provide directly affects the result, which is why POS tagging (covered below) pairs so well with lemmatization.
Stop Words
Stop words are common words like "the", "is", "in", and "and" that carry little meaning on their own. Removing them reduces noise and dimensionality, which can improve the performance of NLP models.
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
Output (truncated):
{"won't", 'with', 'his', 'against', "you're", 'me', 'after', 'there',
"hadn't", 'out', "we've", 'which', "hasn't", 'on', 'those', 'under',
'were', 'because', 'weren', 've', 'yourselves', 'having', 'some',
"i've", 'does', 'above', 'ma', 'over', 'their', 'himself', "shan't",
...}
Part-of-Speech Tagging
POS tagging assigns a grammatical category (noun, verb, adjective, etc.) to each word in a sentence. This is essential for understanding meaning - for example, the word "bark" can be a noun (the sound a dog makes) or a verb (to make a sound). The tagger resolves this based on context.
NLTK uses a model trained on the Penn Treebank (Wall Street Journal corpus).
import nltk
sentence = "Apple Inc. is an American multinational technology company headquartered in Cupertino, California."
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
print(tagged)
Output:
[('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('an', 'DT'),
('American', 'JJ'), ('multinational', 'NN'), ('technology', 'NN'),
('company', 'NN'), ('headquartered', 'VBD'), ('in', 'IN'),
('Cupertino', 'NNP'), (',', ','), ('California', 'NNP'), ('.', '.')]
Common POS tags: NNP = proper noun, VBZ = verb (3rd person singular present), DT = determiner, JJ = adjective, NN = noun, IN = preposition.
See the full list of POS tags: Penn Treebank POS Tags
Putting It All Together: Text Cleaning Pipeline
A practical text cleaning function that combines lowercasing, tokenization, punctuation removal, and stop word filtering:
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def clean_text(text: str) -> list[str]:
punct = string.punctuation
stop_words = set(stopwords.words("english"))
text = text.lower()
tokens = word_tokenize(text)
words = []
for word in tokens:
word = word.strip(punct)
if word and word not in stop_words:
words.append(word)
return words
Running it on a sample paragraph:
paragraph = """
Osmani is building an AI system that helps companies understand their data faster.
He was running multiple experiments, analyzing results, and improving the models every day.
The system connects users, generates insights, and simplifies complex workflows for developers and analysts.
"""
clean_paragraph = clean_text(paragraph)
print(clean_paragraph)
Output:
['osmani', 'building', 'ai', 'system', 'helps', 'companies', 'understand',
'data', 'faster', 'running', 'multiple', 'experiments', 'analyzing',
'results', 'improving', 'models', 'every', 'day', 'system', 'connects',
'users', 'generates', 'insights', 'simplifies', 'complex', 'workflows',
'developers', 'analysts']
All filler words are gone and the remaining tokens capture the actual meaning of the text.
Named Entity Recognition (NER)
NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, and dates. NLTK combines POS tagging with chunking to detect these entities.
import nltk
pos_tags = nltk.pos_tag(clean_paragraph)
named_entities = nltk.ne_chunk(pos_tags)
for chunk in named_entities:
if hasattr(chunk, 'label'):
inside = " ".join(f"{word}/{tag}" for word, tag in chunk)
print(f" ({chunk.label()} {inside})")
else:
word, tag = chunk
print(f" {word}/{tag}")
The output labels each token with its POS tag, and groups recognized entities under labels like PERSON, ORGANIZATION, or GPE (geo-political entity).
Key Takeaways
- Tokenization is always the first step - split raw text into sentences and words before anything else
- Stemming is fast and rule-based but crude; good for search and indexing
- Lemmatization is context-aware and more accurate; preferred for semantic tasks
- Stop word removal cuts noise by dropping frequent low-information words
- POS tagging reveals grammatical structure and resolves word ambiguity
- NER extracts structured entities (people, places, organizations) from unstructured text
- Combine these techniques into a cleaning pipeline before feeding text to downstream models