Intro Exercises
Hands-on exercises that apply the NLP pipeline concepts from the previous articles to a real piece of technical documentation - a ControlUp uninstallation guide. These exercises expose the practical strengths and weaknesses of each pipeline stage when operating on domain-specific IT text.
The Input Text
Both exercises process the same paragraph, which represents a typical procedural IT knowledge base article:
text = """
This article explains how to uninstall ControlUp for Apps.
Verify that the AppDXHelper.exe process is running on the machine you want to uninstall.
In ControlUp for Desktops, ensure the machine is assigned to a specific tag or device group.
This group should contain machines where ControlUp for Apps should not be installed.
Under ControlUp for Apps settings, verify that this tag is not included in the targeted device tags list.
Restart the ControlUp for Desktop service or reboot the machine and confirm that the AppDXHelper.exe process no longer appears in Task Manager.
"""
This text is a good test case because it contains product names (ControlUp), filenames (AppDXHelper.exe), Windows-specific terminology (Task Manager), and critical negation ("should not be installed") - all things that challenge a standard NLP pipeline.
Exercise 1: Full NLP Pipeline on Technical Text
Run the complete pipeline - sentence tokenization, word tokenization, stopword removal, stemming, lemmatization, POS tagging, and NER - on the input text, then analyze where each stage succeeds and where it breaks down.
Code
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
# --- Sentence Tokenization ---
sentences = sent_tokenize(text.strip())
print("Sentence Tokenization")
print("=" * 40)
for i, s in enumerate(sentences, 1):
print(f" {i}. {s}")
print(f"\nTotal sentences: {len(sentences)}")
# --- Word Tokenization ---
tokens = word_tokenize(text)
print("\n\nWord Tokenization")
print("=" * 40)
print(tokens)
print(f"\nTotal tokens: {len(tokens)}")
# --- Remove Stopwords and Punctuation ---
stop_words = set(stopwords.words("english"))
punct = set(string.punctuation)
cleaned = [
tok for tok in tokens
if tok.lower() not in stop_words and tok not in punct
]
print("\n\nCleaned Tokens (no stopwords, no punctuation)")
print("=" * 40)
print(cleaned)
print(f"\nCleaned token count: {len(cleaned)}")
# --- Stemming vs Lemmatization (side by side) ---
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print("\n\nStemming (Porter) vs Lemmatization (verb)")
print("=" * 40)
print(f" {'Token':<25} {'Stemmed':<25} {'Lemmatized'}")
print(f" {'-'*25} {'-'*25} {'-'*25}")
for tok in cleaned:
stemmed = stemmer.stem(tok)
lemmatized = lemmatizer.lemmatize(tok.lower(), pos="v")
print(f" {tok:<25} {stemmed:<25} {lemmatized}")
# --- POS Tagging ---
pos_tags = nltk.pos_tag(cleaned)
print("\n\nPOS Tagging")
print("=" * 40)
for word, tag in pos_tags:
print(f" {word:<25} {tag}")
# --- Named Entity Recognition ---
named_entities = nltk.ne_chunk(pos_tags)
print("\n\nNamed Entity Recognition (NER)")
print("=" * 40)
for chunk in named_entities:
if hasattr(chunk, 'label'):
entity_text = " ".join(word for word, tag in chunk)
print(f" [{chunk.label()}] {entity_text}")
else:
word, tag = chunk
print(f" {word}/{tag}")
Sentence Tokenization
Output:
Sentence Tokenization
========================================
1. This article explains how to uninstall ControlUp for Apps.
2. Verify that the AppDXHelper.exe process is running on the machine you want to uninstall.
3. In ControlUp for Desktops, ensure the machine is assigned to a specific tag or device group.
4. This group should contain machines where ControlUp for Apps should not be installed.
5. Under ControlUp for Apps settings, verify that this tag is not included in the targeted device tags list.
6. Restart the ControlUp for Desktop service or reboot the machine and confirm that the AppDXHelper.exe process no longer appears in Task Manager.
Total sentences: 6
NLTK's sent_tokenize correctly split all six sentences, even handling the tricky AppDXHelper.exe period - the tokenizer recognized it as part of a filename rather than a sentence boundary.
Word Tokenization
Output:
Word Tokenization
========================================
['This', 'article', 'explains', 'how', 'to', 'uninstall', 'ControlUp', 'for',
'Apps', '.', 'Verify', 'that', 'the', 'AppDXHelper.exe', 'process', 'is',
'running', 'on', 'the', 'machine', 'you', 'want', 'to', 'uninstall', '.',
'In', 'ControlUp', 'for', 'Desktops', ',', 'ensure', 'the', 'machine', 'is',
'assigned', 'to', 'a', 'specific', 'tag', 'or', 'device', 'group', '.',
'This', 'group', 'should', 'contain', 'machines', 'where', 'ControlUp', 'for',
'Apps', 'should', 'not', 'be', 'installed', '.', 'Under', 'ControlUp', 'for',
'Apps', 'settings', ',', 'verify', 'that', 'this', 'tag', 'is', 'not',
'included', 'in', 'the', 'targeted', 'device', 'tags', 'list', '.', 'Restart',
'the', 'ControlUp', 'for', 'Desktop', 'service', 'or', 'reboot', 'the',
'machine', 'and', 'confirm', 'that', 'the', 'AppDXHelper.exe', 'process', 'no',
'longer', 'appears', 'in', 'Task', 'Manager', '.']
Total tokens: 100
Word tokenization produced 100 tokens. Punctuation marks (. and ,) are split into their own tokens, which is the expected behavior - they get filtered out in the next step.
Stopword and Punctuation Removal
Output:
Cleaned Tokens (no stopwords, no punctuation)
========================================
['article', 'explains', 'uninstall', 'ControlUp', 'Apps', 'Verify',
'AppDXHelper.exe', 'process', 'running', 'machine', 'want', 'uninstall',
'ControlUp', 'Desktops', 'ensure', 'machine', 'assigned', 'specific', 'tag',
'device', 'group', 'group', 'contain', 'machines', 'ControlUp', 'Apps',
'installed', 'ControlUp', 'Apps', 'settings', 'verify', 'tag', 'included',
'targeted', 'device', 'tags', 'list', 'Restart', 'ControlUp', 'Desktop',
'service', 'reboot', 'machine', 'confirm', 'AppDXHelper.exe', 'process',
'longer', 'appears', 'Task', 'Manager']
Cleaned token count: 50
The token count dropped from 100 to 50 - half the words were stopwords or punctuation. But this step introduced a serious problem: it deleted the negation words "not", "no", and similar terms.
| Original phrase | After stopword removal | Problem |
|---|---|---|
| should not be installed | installed | Meaning completely inverted |
| is not included in the targeted device tags list | included targeted... | Meaning completely inverted |
| no longer appears in Task Manager | longer appears in Task... | Negation lost |
NLTK's default English stopword list treats all negation words as stopwords because they are high-frequency. But in instructional and procedural text like this, negation is essential to understanding what the user should and should not do. The difference between "install" and "do not install" is the entire point of the document.
Recommendation: Customize the stopword list for this domain - specifically, remove "not", "no", and "nor" from the stopwords so that negation is preserved.
Stemming vs. Lemmatization
Output (selected rows):
Stemming (Porter) vs Lemmatization (verb)
========================================
Token Stemmed Lemmatized
------------------------- ------------------------- -------------------------
machine machin machine
service servic service
device devic device
specific specif specific
verify verifi verify
settings set settings
AppDXHelper.exe appdxhelper.ex appdxhelper.exe
Lemmatization produced significantly more readable and accurate results than stemming for this technical text. The Porter Stemmer aggressively truncated words into non-word fragments: "machine" became "machin", "service" became "servic", "device" became "devic". None of these are recognizable English words.
A particularly telling example is "settings", which the stemmer reduced to "set" - completely changing the meaning - while the lemmatizer correctly kept it as "settings". This difference arises because stemming applies blind suffix-stripping rules without any understanding of the word, while lemmatization consults a lexical dictionary (WordNet) and considers part of speech to find the true base form.
The lemmatizer did produce one oddity: "installed" was lemmatized to "instal" (an archaic British spelling present in WordNet) rather than the expected "install".
Verdict: For domain-specific technical documentation where precise terminology matters, lemmatization is clearly the better choice.
POS Tagging
Output (selected rows):
POS Tagging
========================================
article NN
explains VBZ
uninstall JJ
ControlUp NNP
Apps NNP
Verify NNP
AppDXHelper.exe NNP
running VBG
machine NN
settings NNS
Restart NNP
Desktop NNP
Task NNP
Manager NNP
The POS tagger assigned reasonable tags for most tokens. Product names like ControlUp and Apps were tagged as NNP (proper nouns), which is correct. However, "uninstall" was tagged as JJ (adjective) rather than VB (verb) - the tagger struggled without surrounding function words that were stripped during stopword removal. Similarly, "Verify" and "Restart" were tagged as NNP because their initial capitalization (they start sentences) made them look like proper nouns after context was lost.
Named Entity Recognition
Output:
Named Entity Recognition (NER)
========================================
article/NN
explains/VBZ
uninstall/JJ
[ORGANIZATION] ControlUp Apps Verify
AppDXHelper.exe/NNP
process/NN
running/VBG
machine/NN
...
[ORGANIZATION] ControlUp Apps
installed/VBD
[ORGANIZATION] ControlUp Apps
settings/NNS
...
[PERSON] Restart ControlUp Desktop
service/NN
reboot/NN
machine/NN
...
[PERSON] Task Manager
The NER made several notable errors:
Correct identifications:
[ORGANIZATION] ControlUp- correctly recognized as an organization
Errors:
[ORGANIZATION] ControlUp Apps(two occurrences) - The actual product name is "ControlUp for Apps", but the stopword "for" was removed during preprocessing, leaving "ControlUp" and "Apps" adjacent. The NER merged them into a single entity with the wrong name - close enough to look correct, but the recognized entity does not match the real product name.[ORGANIZATION] ControlUp Apps Verify- "Verify" is a verb that begins a new sentence, but after stopword removal the sentence boundary was lost, causing the NER to absorb it into the preceding entity. This compounds the "ControlUp Apps" error above with an additional spurious token.[PERSON] Restart ControlUp Desktop- "Restart" is a verb and "ControlUp for Desktop" is a product name. The capital "R" (sentence-initial) misled the recognizer into treating it as a proper noun.[PERSON] Task Manager- This is a Windows application, not a person. The two capitalized words in sequence matched the recognizer's pattern for human names.
Root causes:
- Stopword and punctuation removal stripped away sentence structure and context that the NER relies on for disambiguation.
- NLTK's NER is trained primarily on newswire text and lacks familiarity with software product names, filenames, and IT terminology like "ControlUp", "AppDXHelper.exe", or "Task Manager".
Exercise 2: Domain-Specific Stopwords
The default NLTK stopword list is designed for general English. But within a specialized domain like IT documentation, certain content words appear so frequently that they stop being informative. This exercise explores what happens when you add domain-specific terms to the stopword list.
Setup
Three domain-specific words are added to the default stopwords: "machine", "process", and "service". The full pipeline runs twice - once with default stopwords and once with the custom set - to compare outputs.
Code
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('maxent_ne_chunker_tab', quiet=True)
nltk.download('words', quiet=True)
CUSTOM_STOPWORDS = ["machine", "process", "service"]
def run_pipeline(text, stop_words, label):
"""Run the full NLP pipeline with a given stopword set and print results."""
print(f"\n{'#' * 60}")
print(f" {label}")
print(f"{'#' * 60}")
tokens = word_tokenize(text)
punct = set(string.punctuation)
cleaned = [
tok for tok in tokens
if tok.lower() not in stop_words and tok not in punct
]
print(f"\nCleaned Tokens ({len(cleaned)} tokens)")
print("=" * 40)
print(cleaned)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(f"\n{'Token':<25} {'Stemmed':<25} {'Lemmatized'}")
print(f"{'-'*25} {'-'*25} {'-'*25}")
for tok in cleaned:
stemmed = stemmer.stem(tok)
lemmatized = lemmatizer.lemmatize(tok.lower(), pos="v")
print(f"{tok:<25} {stemmed:<25} {lemmatized}")
pos_tags = nltk.pos_tag(cleaned)
print("\nNamed Entity Recognition (NER)")
print("=" * 40)
named_entities = nltk.ne_chunk(pos_tags)
for chunk in named_entities:
if hasattr(chunk, 'label'):
entity_text = " ".join(word for word, tag in chunk)
print(f" [{chunk.label()}] {entity_text}")
else:
word, tag = chunk
print(f" {word}/{tag}")
return cleaned
def main():
default_stop_words = set(stopwords.words("english"))
custom_stop_words = default_stop_words | set(CUSTOM_STOPWORDS)
default_cleaned = run_pipeline(text, default_stop_words, "DEFAULT STOPWORDS")
custom_cleaned = run_pipeline(
text, custom_stop_words,
f"CUSTOM STOPWORDS (added: {CUSTOM_STOPWORDS})"
)
removed_tokens = [tok for tok in default_cleaned if tok.lower() in CUSTOM_STOPWORDS]
print(f"\n{'#' * 60}")
print(" COMPARISON")
print(f"{'#' * 60}")
print(f"\nTokens removed by custom stopwords: {removed_tokens}")
print(f"Default token count: {len(default_cleaned)}")
print(f"Custom token count: {len(custom_cleaned)}")
print(f"Tokens eliminated: {len(default_cleaned) - len(custom_cleaned)}")
if __name__ == "__main__":
main()
Token Reduction
The comparison output:
############################################################
COMPARISON
############################################################
Tokens removed by custom stopwords: ['process', 'machine', 'machine', 'service', 'machine', 'process']
Default token count: 50
Custom token count: 44
Tokens eliminated: 6
| Metric | Value |
|---|---|
| Default token count | 50 |
| Custom token count | 44 |
| Tokens eliminated | 6 |
Removed tokens: process, machine, machine, service, machine, process
"machine" appeared 3 times, "process" 2 times, and "service" once. These words are extremely common in IT/sysadmin documentation but carry almost no distinguishing meaning within that domain - they function as background noise rather than meaningful content words.
NER Ripple Effect
Removing these tokens subtly changed downstream NER behavior. Compare the custom-stopwords NER output against the default:
# Default stopwords NER (excerpt)
...
process/NN
longer/RB <-- tagged as adverb
appears/VBZ
...
# Custom stopwords NER (excerpt)
...
longer/JJR <-- re-tagged as comparative adjective
appears/VBZ
...
"longer" was re-tagged from RB (adverb) to JJR (comparative adjective) because removing "process" before it changed the surrounding context the POS tagger relied on. This demonstrates that stopword choices ripple through the entire pipeline - what you remove upstream affects tagging and entity recognition downstream.
What This Tells Us About Domain-Specific Preprocessing
Generic stopword lists are a starting point, not a final answer. NLTK's default list is built for general English and does a reasonable job of filtering high-frequency function words like "the", "is", and "in". But within a specialized corpus - such as IT uninstallation instructions - certain content words become so frequent that they stop being informative.
"Machine" in a sysadmin document is like "the" in general text: it appears everywhere and tells you nothing you don't already know.
Tailoring the stopword list to the vocabulary of the text's domain can significantly sharpen downstream tasks like keyword extraction, topic modeling, and NER by filtering out terms that are informative in general English but redundant within the specialized corpus.
The flip side: This requires domain knowledge. Blindly adding words risks removing terms that do carry meaning in certain contexts - for example, "process" in a document specifically about OS process management would be a key term, not noise.
Key Takeaways
- Sentence tokenization handles technical text well, correctly distinguishing filenames like
AppDXHelper.exefrom sentence boundaries - Stopword removal is a double-edged sword - it halved the token count but destroyed critical negation, inverting the meaning of safety-critical instructions
- Lemmatization outperforms stemming on technical text where precise terminology matters; stemming produced unrecognizable fragments like "machin" and "servic"
- NER struggles with IT terminology - NLTK's newswire-trained model misclassified "Task Manager" as a person and merged sentence-initial verbs into entity spans
- Preprocessing choices cascade downstream - removing stopwords strips context that POS taggers and NER models depend on, causing secondary errors
- Domain-specific stopwords can sharpen the pipeline by filtering words that are high-frequency noise within the target domain, but require careful selection to avoid removing terms that carry meaning
- Always validate pipeline output against the source text, especially in procedural documentation where negation and precise phrasing determine correctness