Gen AI Developer Week 1 — Day 6
Natural Language Processing (NLP) is the key to bridging the gap between human language and machine understanding. From chatbots to sentiment analysis, NLP enables computers to interpret, process, and generate human language with remarkable accuracy. As part of my Generative AI Developer Week journey, this article explores the fundamentals of NLP, its real-world applications, and how it forms the backbone of many AI-driven innovations today.
Let’s get started by first installing the necessary dependencies.
# Installing NLTK
pip install nltk
# Installing Spacy
pip install spacy
Some concepts to get familiar before proceeding.
Tokenization: Splitting text into smaller units like words or sentences.
Removing Stopwords: Eliminating common words that don’t add much meaning (e.g., “is”, “the”).
Stemming: Reducing words to their root forms (e.g., “playing” → “play”).
Lemmatization: Reducing words to their dictionary base forms (e.g., “mice” → “mouse”).
Tokenization with NLTK
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download necessary data
nltk.download('punkt')
nltk.download('punkt_tab')
text = "Natural Language Processing is amazing. It helps computers understand human language."
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Word Tokenization
words = word_tokenize(text)
print("Words:", words)
Removing Stopwords
from nltk.corpus import stopwords
# Download necessary data
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)
Info: all the necessary data for nltk can be found here.
Stemming with NLTK
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("Stemmed Words:", stemmed_words)
Lemmatization with spaCy
import spacy
nlp = spacy.load('en_core_web_sm') # Loads English model
doc = nlp(" ".join(filtered_words))
lemmatized_words = [token.lemma_ for token in doc]
print("Lemmatized Words:", lemmatized_words)
Info: Get all models of spaCy from here.
Practice Task — 1 — Process a New Text
Choose a paragraph of your choice.
Perform tokenization, stopword removal, stemming, and lemmatization.
Practice Task — 2 — Count the frequency of each unique word in the text after preprocessing.
Practice Task — 3— Create a bar chart of the top 10 most frequent words using Matplotlib.
Happy Learning!😊.. For any questions or support, feel free to message me on LinkedIn.