Gen AI Developer Week 1 — Day 6

3 min read1 day ago

Natural Language Processing (NLP) is the key to bridging the gap between human language and machine understanding. From chatbots to sentiment analysis, NLP enables computers to interpret, process, and generate human language with remarkable accuracy. As part of my Generative AI Developer Week journey, this article explores the fundamentals of NLP, its real-world applications, and how it forms the backbone of many AI-driven innovations today.

Let’s get started by first installing the necessary dependencies.

# Installing NLTK
pip install nltk

# Installing Spacy
pip install spacy

Some concepts to get familiar before proceeding.

Tokenization: Splitting text into smaller units like words or sentences.

Removing Stopwords: Eliminating common words that don’t add much meaning (e.g., “is”, “the”).

Stemming: Reducing words to their root forms (e.g., “playing” → “play”).

Lemmatization: Reducing words to their dictionary base forms (e.g., “mice” → “mouse”).

Tokenization with NLTK

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download necessary data
nltk.download('punkt')
nltk.download('punkt_tab')

text = "Natural Language Processing is amazing. It helps computers understand human language."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word Tokenization
words = word_tokenize(text)
print("Words:", words)

Removing Stopwords

from nltk.corpus import stopwords

# Download necessary data
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Filtered Words:", filtered_words)

Stopwords Output

Info: all the necessary data for nltk can be found here.

Stemming with NLTK

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]
print("Stemmed Words:", stemmed_words)

Stemming Output

Lemmatization with spaCy

import spacy

nlp = spacy.load('en_core_web_sm') # Loads English model
doc = nlp(" ".join(filtered_words))

lemmatized_words = [token.lemma_ for token in doc]
print("Lemmatized Words:", lemmatized_words)

Lemmatization Output

Info: Get all models of spaCy from here.

Practice Task — 1 — Process a New Text

Choose a paragraph of your choice.

Perform tokenization, stopword removal, stemming, and lemmatization.

Practice Task — 2 — Count the frequency of each unique word in the text after preprocessing.

Practice Task — 3— Create a bar chart of the top 10 most frequent words using Matplotlib.

Happy Learning!😊.. For any questions or support, feel free to message me on LinkedIn.

Gen AI Developer Week 1 — Day 6

Written by Sai Chinmay Tripurari

No responses yet