Natural Language Processing (or NLP) is applying Machine Learning models to text and language.
NLP refers to AI method of communicating with an intelligent systems using a natural language such as English.
Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. The input and output of an NLP system can be − Speech/Written Text.
Whenever you dictate something into your iPhone / Android device that is then converted to text (Speed to Text), that’s an NLP algorithm in action.
You can also use NLP on below:
A very well-known model in NLP is the Bag of Words model. It is a model used to pre-process the texts to classify before fitting the classification algorithms on the observations containing the texts.
In this part, you will understand and learn how to:
Code: Natural Language Processing
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
# Cleaning the texts
# Remove non-alphabetic letters
# Remove stop words like - a, the, this, ,on,so,and hadm they etc
# Remove punctuation like - ...
# Remove stemming like - Convert loved to love
# Get rid to capitalization
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
# Creating the Bag of Words model
# Take Unique Words only (no repetition) and then create Column for each unique word
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray() # 1500 Independent Variables here
y = dataset.iloc[:, 1].values
# Now Apply Classification Algorithm here (Naive Bayes, Decision Tree or Random Forest Classification)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Here no Plotting can be done as independent variables (X) are 1500 and not just one
# Thus making the Confusion Matrix only - To evauate performance of model to see correct/incorrection predictions made by Logistic regression
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Hope this helps!!
Arun Manglick
NLP refers to AI method of communicating with an intelligent systems using a natural language such as English.
Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. The input and output of an NLP system can be − Speech/Written Text.
Whenever you dictate something into your iPhone / Android device that is then converted to text (Speed to Text), that’s an NLP algorithm in action.
There are two Concepts of NLP
- NLU - Natural Language Understanding
- NLG - Natural Language Generation
NLU is harder than NLG, due to multiple ambiguities: Lexical(Noun/Verb)/Syntax/Referential
- Text review to predict if the review is a good one or a bad one.
- On an article to predict some categories of the articles you are trying to segment.
- On a book to predict the type of the book (Education, Mystery etc).
- Email Reading & Filter Spam Emails
- To build a machine translator or a speech recognition system, and you can use classification algorithms to classify language.
A very well-known model in NLP is the Bag of Words model. It is a model used to pre-process the texts to classify before fitting the classification algorithms on the observations containing the texts.
In this part, you will understand and learn how to:
- Clean texts to prepare them for the Machine Learning models,
- Create a Bag of Words model,
- Apply Machine Learning models onto this Bag of Worlds model.
Code: Natural Language Processing
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
# Cleaning the texts
# Remove non-alphabetic letters
# Remove stop words like - a, the, this, ,on,so,and hadm they etc
# Remove punctuation like - ...
# Remove stemming like - Convert loved to love
# Get rid to capitalization
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
# Creating the Bag of Words model
# Take Unique Words only (no repetition) and then create Column for each unique word
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray() # 1500 Independent Variables here
y = dataset.iloc[:, 1].values
# Now Apply Classification Algorithm here (Naive Bayes, Decision Tree or Random Forest Classification)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Here no Plotting can be done as independent variables (X) are 1500 and not just one
# Thus making the Confusion Matrix only - To evauate performance of model to see correct/incorrection predictions made by Logistic regression
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
Hope this helps!!
Arun Manglick
No comments:
Post a Comment