Friday, July 14, 2017

Part 7: Natural Language Processing

Natural Language Processing (or NLP) is applying Machine Learning models to text and language.
NLP refers to AI method of communicating with an intelligent systems using a natural language such as English.

Teaching machines to understand what is said in spoken and written word is the focus of Natural Language Processing. The input and output of an NLP system can be − Speech/Written Text.
Whenever you dictate something into your iPhone / Android device that is then converted to text (Speed to Text), that’s an NLP algorithm in action.

There are two Concepts of NLP
  • NLU - Natural Language Understanding
  • NLG - Natural Language Generation 

NLU is harder than NLG, due to multiple ambiguities: Lexical(Noun/Verb)/Syntax/Referential
You can also use NLP on below:
  • Text review to predict if the review is a good one or a bad one. 
  • On an article to predict some categories of the articles you are trying to segment. 
  • On a book to predict the type of the book (Education, Mystery etc).
  • Email Reading & Filter Spam Emails
  • To build a machine translator or a speech recognition system, and you can use classification algorithms to classify language. 
Speaking of classification algorithms, most of NLP algorithms are Classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.

A very well-known model in NLP is the Bag of Words model. It is a model used to pre-process the texts to classify before fitting the classification algorithms on the observations containing the texts.

In this part, you will understand and learn how to:
  • Clean texts to prepare them for the Machine Learning models,
  • Create a Bag of Words model,
  • Apply Machine Learning models onto this Bag of Worlds model.

Code: Natural Language Processing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Cleaning the texts
# Remove non-alphabetic letters
# Remove stop words like - a, the, this, ,on,so,and hadm they etc
# Remove punctuation like - ...
# Remove stemming like - Convert loved to love 
# Get rid to capitalization 
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

# Creating the Bag of Words model
# Take Unique Words only (no repetition) and then create Column for each unique word
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()  # 1500 Independent Variables here
y = dataset.iloc[:, 1].values

# Now Apply Classification Algorithm here (Naive Bayes, Decision Tree or Random Forest Classification)
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Here no Plotting can be done as independent variables (X) are 1500 and not just one
# Thus making the Confusion Matrix only - To evauate performance of model to see correct/incorrection predictions made by Logistic regression
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)


Hope this helps!!

Arun Manglick

No comments:

Post a Comment