Simple Data Analysis Project (E-Commerce)



This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers.

It's nine supportive features offer a great environment to parse out the text through its multiple dimensions.

In this research, we attempt to analyze the customer reviews on women clothing e-commerce by employing

statistical analysis and sentiment classification. We first analyze the non-text review features

(e.g. Age groups that are likely to recommend the clothing line. , The probability of a clothing

line with high ratings getting recommended class of dress purchased, etc.) found

in the dataset, as an attempt to unravel any connection between them and customer

recommendation on the product. Then, we implement a random forest model for classifying

whether a review text recommends the purchased product or not.


This dataset includes 23486 rows and 10 feature variables. Each row

corresponds to a customer review, and includes the variables:

● Clothing ID: Integer Categorical variable that refers to the specific piece

being reviewed.

● Age: Positive Integer variable of the reviewers age.

● Title: String variable for the title of the review.

● Review Text: String variable for the review body.

● Rating: Positive Ordinal Integer variable for the product score granted by

the customer from 1 Worst, to 5 Best.

● Recommended IND: Binary variable stating where the customer

recommends the product where 1 is recommended, 0 is not


● Positive Feedback Count: Positive Integer documenting the number of

other customers who found this review positive.

● Division Name: Categorical name of the product high level division.

Graph Evaluation

(i) Youngsters aged 24-66 are more interested in providing the review.

(ii) After 50, the review has constantly declined with the age

(iii) From the matplot above, the number of people fully satisfied (5 star), is

nearly the combined sum of people from 1 star to 4 star.

(iv) Most of the customers are satisfied

(v) There isn’t a significant difference in the box-plots across various age groups.

(vi) Basically, all the age groups are satisfied to the same extent.

(vii) In the first one, intimate division has a very high probability of getting

recommended. General ones are least recommended.

(viii) Similarly in the second one, Bottoms are the first recommendation of people

followed by Tops. Trend ones are highly unlikely.

(ix) In the third one, Lounge and knits are top choices and highly likely to be

recommended. Jeans, Legwear, Outerwear, Shorts and Layering are among the

least recommended.

Research Provided by Andrey Fateev