Intro Exercises

Hands-on exercises that apply the NLP pipeline concepts from the previous articles to a real piece of technical documentation - a ControlUp uninstallation guide. These exercises expose the practical strengths and weaknesses of each pipeline stage when operating on domain-specific IT text.

The Input Text

Both exercises process the same paragraph, which represents a typical procedural IT knowledge base article:

text = """
This article explains how to uninstall ControlUp for Apps.
Verify that the AppDXHelper.exe process is running on the machine you want to uninstall.
In ControlUp for Desktops, ensure the machine is assigned to a specific tag or device group.
This group should contain machines where ControlUp for Apps should not be installed.
Under ControlUp for Apps settings, verify that this tag is not included in the targeted device tags list.
Restart the ControlUp for Desktop service or reboot the machine and confirm that the AppDXHelper.exe process no longer appears in Task Manager.
"""

This text is a good test case because it contains product names (ControlUp), filenames (AppDXHelper.exe), Windows-specific terminology (Task Manager), and critical negation ("should not be installed") - all things that challenge a standard NLP pipeline.

Exercise 1: Full NLP Pipeline on Technical Text

Run the complete pipeline - sentence tokenization, word tokenization, stopword removal, stemming, lemmatization, POS tagging, and NER - on the input text, then analyze where each stage succeeds and where it breaks down.

Code

import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')

# --- Sentence Tokenization ---
sentences = sent_tokenize(text.strip())
print("Sentence Tokenization")
print("=" * 40)
for i, s in enumerate(sentences, 1):
    print(f"  {i}. {s}")
print(f"\nTotal sentences: {len(sentences)}")

# --- Word Tokenization ---
tokens = word_tokenize(text)
print("\n\nWord Tokenization")
print("=" * 40)
print(tokens)
print(f"\nTotal tokens: {len(tokens)}")

# --- Remove Stopwords and Punctuation ---
stop_words = set(stopwords.words("english"))
punct = set(string.punctuation)

cleaned = [
    tok for tok in tokens
    if tok.lower() not in stop_words and tok not in punct
]

print("\n\nCleaned Tokens (no stopwords, no punctuation)")
print("=" * 40)
print(cleaned)
print(f"\nCleaned token count: {len(cleaned)}")

# --- Stemming vs Lemmatization (side by side) ---
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

print("\n\nStemming (Porter) vs Lemmatization (verb)")
print("=" * 40)
print(f"  {'Token':<25} {'Stemmed':<25} {'Lemmatized'}")
print(f"  {'-'*25} {'-'*25} {'-'*25}")
for tok in cleaned:
    stemmed = stemmer.stem(tok)
    lemmatized = lemmatizer.lemmatize(tok.lower(), pos="v")
    print(f"  {tok:<25} {stemmed:<25} {lemmatized}")

# --- POS Tagging ---
pos_tags = nltk.pos_tag(cleaned)
print("\n\nPOS Tagging")
print("=" * 40)
for word, tag in pos_tags:
    print(f"  {word:<25} {tag}")

# --- Named Entity Recognition ---
named_entities = nltk.ne_chunk(pos_tags)
print("\n\nNamed Entity Recognition (NER)")
print("=" * 40)
for chunk in named_entities:
    if hasattr(chunk, 'label'):
        entity_text = " ".join(word for word, tag in chunk)
        print(f"  [{chunk.label()}] {entity_text}")
    else:
        word, tag = chunk
        print(f"  {word}/{tag}")

Sentence Tokenization

Output:

Sentence Tokenization
========================================
  1. This article explains how to uninstall ControlUp for Apps.
  2. Verify that the AppDXHelper.exe process is running on the machine you want to uninstall.
  3. In ControlUp for Desktops, ensure the machine is assigned to a specific tag or device group.
  4. This group should contain machines where ControlUp for Apps should not be installed.
  5. Under ControlUp for Apps settings, verify that this tag is not included in the targeted device tags list.
  6. Restart the ControlUp for Desktop service or reboot the machine and confirm that the AppDXHelper.exe process no longer appears in Task Manager.

Total sentences: 6

NLTK's sent_tokenize correctly split all six sentences, even handling the tricky AppDXHelper.exe period - the tokenizer recognized it as part of a filename rather than a sentence boundary.

Word Tokenization

Output:

Word Tokenization
========================================
['This', 'article', 'explains', 'how', 'to', 'uninstall', 'ControlUp', 'for',
 'Apps', '.', 'Verify', 'that', 'the', 'AppDXHelper.exe', 'process', 'is',
 'running', 'on', 'the', 'machine', 'you', 'want', 'to', 'uninstall', '.',
 'In', 'ControlUp', 'for', 'Desktops', ',', 'ensure', 'the', 'machine', 'is',
 'assigned', 'to', 'a', 'specific', 'tag', 'or', 'device', 'group', '.',
 'This', 'group', 'should', 'contain', 'machines', 'where', 'ControlUp', 'for',
 'Apps', 'should', 'not', 'be', 'installed', '.', 'Under', 'ControlUp', 'for',
 'Apps', 'settings', ',', 'verify', 'that', 'this', 'tag', 'is', 'not',
 'included', 'in', 'the', 'targeted', 'device', 'tags', 'list', '.', 'Restart',
 'the', 'ControlUp', 'for', 'Desktop', 'service', 'or', 'reboot', 'the',
 'machine', 'and', 'confirm', 'that', 'the', 'AppDXHelper.exe', 'process', 'no',
 'longer', 'appears', 'in', 'Task', 'Manager', '.']

Total tokens: 100

Word tokenization produced 100 tokens. Punctuation marks (. and ,) are split into their own tokens, which is the expected behavior - they get filtered out in the next step.

Stopword and Punctuation Removal

Output:

Cleaned Tokens (no stopwords, no punctuation)
========================================
['article', 'explains', 'uninstall', 'ControlUp', 'Apps', 'Verify',
 'AppDXHelper.exe', 'process', 'running', 'machine', 'want', 'uninstall',
 'ControlUp', 'Desktops', 'ensure', 'machine', 'assigned', 'specific', 'tag',
 'device', 'group', 'group', 'contain', 'machines', 'ControlUp', 'Apps',
 'installed', 'ControlUp', 'Apps', 'settings', 'verify', 'tag', 'included',
 'targeted', 'device', 'tags', 'list', 'Restart', 'ControlUp', 'Desktop',
 'service', 'reboot', 'machine', 'confirm', 'AppDXHelper.exe', 'process',
 'longer', 'appears', 'Task', 'Manager']

Cleaned token count: 50

The token count dropped from 100 to 50 - half the words were stopwords or punctuation. But this step introduced a serious problem: it deleted the negation words "not", "no", and similar terms.

Original phrase	After stopword removal	Problem
should not be installed	installed	Meaning completely inverted
is not included in the targeted device tags list	included targeted...	Meaning completely inverted
no longer appears in Task Manager	longer appears in Task...	Negation lost

NLTK's default English stopword list treats all negation words as stopwords because they are high-frequency. But in instructional and procedural text like this, negation is essential to understanding what the user should and should not do. The difference between "install" and "do not install" is the entire point of the document.

Recommendation: Customize the stopword list for this domain - specifically, remove "not", "no", and "nor" from the stopwords so that negation is preserved.

Stemming vs. Lemmatization

Output (selected rows):

Stemming (Porter) vs Lemmatization (verb)
========================================
  Token                     Stemmed                   Lemmatized
  ------------------------- ------------------------- -------------------------
  machine                   machin                    machine
  service                   servic                    service
  device                    devic                     device
  specific                  specif                    specific
  verify                    verifi                    verify
  settings                  set                       settings
  AppDXHelper.exe           appdxhelper.ex            appdxhelper.exe

Lemmatization produced significantly more readable and accurate results than stemming for this technical text. The Porter Stemmer aggressively truncated words into non-word fragments: "machine" became "machin", "service" became "servic", "device" became "devic". None of these are recognizable English words.

A particularly telling example is "settings", which the stemmer reduced to "set" - completely changing the meaning - while the lemmatizer correctly kept it as "settings". This difference arises because stemming applies blind suffix-stripping rules without any understanding of the word, while lemmatization consults a lexical dictionary (WordNet) and considers part of speech to find the true base form.

The lemmatizer did produce one oddity: "installed" was lemmatized to "instal" (an archaic British spelling present in WordNet) rather than the expected "install".

Verdict: For domain-specific technical documentation where precise terminology matters, lemmatization is clearly the better choice.

POS Tagging

Output (selected rows):

POS Tagging
========================================
  article                   NN
  explains                  VBZ
  uninstall                 JJ
  ControlUp                 NNP
  Apps                      NNP
  Verify                    NNP
  AppDXHelper.exe           NNP
  running                   VBG
  machine                   NN
  settings                  NNS
  Restart                   NNP
  Desktop                   NNP
  Task                      NNP
  Manager                   NNP

The POS tagger assigned reasonable tags for most tokens. Product names like ControlUp and Apps were tagged as NNP (proper nouns), which is correct. However, "uninstall" was tagged as JJ (adjective) rather than VB (verb) - the tagger struggled without surrounding function words that were stripped during stopword removal. Similarly, "Verify" and "Restart" were tagged as NNP because their initial capitalization (they start sentences) made them look like proper nouns after context was lost.

Named Entity Recognition

Output:

Named Entity Recognition (NER)
========================================
  article/NN
  explains/VBZ
  uninstall/JJ
  [ORGANIZATION] ControlUp Apps Verify
  AppDXHelper.exe/NNP
  process/NN
  running/VBG
  machine/NN
  ...
  [ORGANIZATION] ControlUp Apps
  installed/VBD
  [ORGANIZATION] ControlUp Apps
  settings/NNS
  ...
  [PERSON] Restart ControlUp Desktop
  service/NN
  reboot/NN
  machine/NN
  ...
  [PERSON] Task Manager

The NER made several notable errors:

Correct identifications:

[ORGANIZATION] ControlUp - correctly recognized as an organization

Errors:

[ORGANIZATION] ControlUp Apps (two occurrences) - The actual product name is "ControlUp for Apps", but the stopword "for" was removed during preprocessing, leaving "ControlUp" and "Apps" adjacent. The NER merged them into a single entity with the wrong name - close enough to look correct, but the recognized entity does not match the real product name.
[ORGANIZATION] ControlUp Apps Verify - "Verify" is a verb that begins a new sentence, but after stopword removal the sentence boundary was lost, causing the NER to absorb it into the preceding entity. This compounds the "ControlUp Apps" error above with an additional spurious token.
[PERSON] Restart ControlUp Desktop - "Restart" is a verb and "ControlUp for Desktop" is a product name. The capital "R" (sentence-initial) misled the recognizer into treating it as a proper noun.
[PERSON] Task Manager - This is a Windows application, not a person. The two capitalized words in sequence matched the recognizer's pattern for human names.

Root causes:

Stopword and punctuation removal stripped away sentence structure and context that the NER relies on for disambiguation.
NLTK's NER is trained primarily on newswire text and lacks familiarity with software product names, filenames, and IT terminology like "ControlUp", "AppDXHelper.exe", or "Task Manager".

Exercise 2: Domain-Specific Stopwords

The default NLTK stopword list is designed for general English. But within a specialized domain like IT documentation, certain content words appear so frequently that they stop being informative. This exercise explores what happens when you add domain-specific terms to the stopword list.

Setup

Three domain-specific words are added to the default stopwords: "machine", "process", and "service". The full pipeline runs twice - once with default stopwords and once with the custom set - to compare outputs.

Code

import string

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('maxent_ne_chunker', quiet=True)
nltk.download('maxent_ne_chunker_tab', quiet=True)
nltk.download('words', quiet=True)

CUSTOM_STOPWORDS = ["machine", "process", "service"]


def run_pipeline(text, stop_words, label):
    """Run the full NLP pipeline with a given stopword set and print results."""
    print(f"\n{'#' * 60}")
    print(f"  {label}")
    print(f"{'#' * 60}")

    tokens = word_tokenize(text)
    punct = set(string.punctuation)

    cleaned = [
        tok for tok in tokens
        if tok.lower() not in stop_words and tok not in punct
    ]

    print(f"\nCleaned Tokens ({len(cleaned)} tokens)")
    print("=" * 40)
    print(cleaned)

    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    print(f"\n{'Token':<25} {'Stemmed':<25} {'Lemmatized'}")
    print(f"{'-'*25} {'-'*25} {'-'*25}")
    for tok in cleaned:
        stemmed = stemmer.stem(tok)
        lemmatized = lemmatizer.lemmatize(tok.lower(), pos="v")
        print(f"{tok:<25} {stemmed:<25} {lemmatized}")

    pos_tags = nltk.pos_tag(cleaned)

    print("\nNamed Entity Recognition (NER)")
    print("=" * 40)
    named_entities = nltk.ne_chunk(pos_tags)
    for chunk in named_entities:
        if hasattr(chunk, 'label'):
            entity_text = " ".join(word for word, tag in chunk)
            print(f"  [{chunk.label()}] {entity_text}")
        else:
            word, tag = chunk
            print(f"  {word}/{tag}")

    return cleaned


def main():
    default_stop_words = set(stopwords.words("english"))
    custom_stop_words = default_stop_words | set(CUSTOM_STOPWORDS)

    default_cleaned = run_pipeline(text, default_stop_words, "DEFAULT STOPWORDS")
    custom_cleaned = run_pipeline(
        text, custom_stop_words,
        f"CUSTOM STOPWORDS (added: {CUSTOM_STOPWORDS})"
    )

    removed_tokens = [tok for tok in default_cleaned if tok.lower() in CUSTOM_STOPWORDS]
    print(f"\n{'#' * 60}")
    print("  COMPARISON")
    print(f"{'#' * 60}")
    print(f"\nTokens removed by custom stopwords: {removed_tokens}")
    print(f"Default token count:  {len(default_cleaned)}")
    print(f"Custom token count:   {len(custom_cleaned)}")
    print(f"Tokens eliminated:    {len(default_cleaned) - len(custom_cleaned)}")


if __name__ == "__main__":
    main()

Token Reduction

The comparison output:

############################################################
  COMPARISON
############################################################

Tokens removed by custom stopwords: ['process', 'machine', 'machine', 'service', 'machine', 'process']
Default token count:  50
Custom token count:   44
Tokens eliminated:    6

Metric	Value
Default token count	50
Custom token count	44
Tokens eliminated	6

Removed tokens: process, machine, machine, service, machine, process

"machine" appeared 3 times, "process" 2 times, and "service" once. These words are extremely common in IT/sysadmin documentation but carry almost no distinguishing meaning within that domain - they function as background noise rather than meaningful content words.

NER Ripple Effect

Removing these tokens subtly changed downstream NER behavior. Compare the custom-stopwords NER output against the default:

# Default stopwords NER (excerpt)
  ...
  process/NN
  longer/RB        <-- tagged as adverb
  appears/VBZ
  ...

# Custom stopwords NER (excerpt)
  ...
  longer/JJR       <-- re-tagged as comparative adjective
  appears/VBZ
  ...

"longer" was re-tagged from RB (adverb) to JJR (comparative adjective) because removing "process" before it changed the surrounding context the POS tagger relied on. This demonstrates that stopword choices ripple through the entire pipeline - what you remove upstream affects tagging and entity recognition downstream.

What This Tells Us About Domain-Specific Preprocessing

Generic stopword lists are a starting point, not a final answer. NLTK's default list is built for general English and does a reasonable job of filtering high-frequency function words like "the", "is", and "in". But within a specialized corpus - such as IT uninstallation instructions - certain content words become so frequent that they stop being informative.

"Machine" in a sysadmin document is like "the" in general text: it appears everywhere and tells you nothing you don't already know.

Tailoring the stopword list to the vocabulary of the text's domain can significantly sharpen downstream tasks like keyword extraction, topic modeling, and NER by filtering out terms that are informative in general English but redundant within the specialized corpus.

The flip side: This requires domain knowledge. Blindly adding words risks removing terms that do carry meaning in certain contexts - for example, "process" in a document specifically about OS process management would be a key term, not noise.

Key Takeaways

Sentence tokenization handles technical text well, correctly distinguishing filenames like AppDXHelper.exe from sentence boundaries
Stopword removal is a double-edged sword - it halved the token count but destroyed critical negation, inverting the meaning of safety-critical instructions
Lemmatization outperforms stemming on technical text where precise terminology matters; stemming produced unrecognizable fragments like "machin" and "servic"
NER struggles with IT terminology - NLTK's newswire-trained model misclassified "Task Manager" as a person and merged sentence-initial verbs into entity spans
Preprocessing choices cascade downstream - removing stopwords strips context that POS taggers and NER models depend on, causing secondary errors
Domain-specific stopwords can sharpen the pipeline by filtering words that are high-frequency noise within the target domain, but require careful selection to avoid removing terms that carry meaning
Always validate pipeline output against the source text, especially in procedural documentation where negation and precise phrasing determine correctness

The Input Text​

Exercise 1: Full NLP Pipeline on Technical Text​

Code​

Sentence Tokenization​

Word Tokenization​

Stopword and Punctuation Removal​

Stemming vs. Lemmatization​

POS Tagging​

Named Entity Recognition​

Exercise 2: Domain-Specific Stopwords​

Setup​

Code​

Token Reduction​

NER Ripple Effect​

What This Tells Us About Domain-Specific Preprocessing​

Key Takeaways​

The Input Text

Exercise 1: Full NLP Pipeline on Technical Text

Code

Sentence Tokenization

Word Tokenization

Stopword and Punctuation Removal

Stemming vs. Lemmatization

POS Tagging

Named Entity Recognition

Exercise 2: Domain-Specific Stopwords

Setup

Code

Token Reduction

NER Ripple Effect

What This Tells Us About Domain-Specific Preprocessing

Key Takeaways