Introduction and Business Understanding
The data that we will analyze is “Housing data from Louisiana.” In our analysis, we will be using Clustering methods to put data into similar groups and help an analysis focus on the critical differences between data points rather than be overwhelmed by individual differences that have no patterns associated with them.
The file contains 2,150 houses for sale listed on Redfin.com in 2018. There are several variables included in the file. We want to create a cluster of houses within Baton Rouge, Louisiana, identify data issues, and take appropriate actions. Also, we need to identify variables that should be excluded from the analysis and provide a rationale for doing so.
Overall, the main goal is to prepare the data set for better analysis by using clustering methods.
The “redfin_2018_Houses Raw Data” JMP file contains 2,150 records of houses across Louisiana. Before Clustering could be performed, a significant amount of data preprocessing was necessary, including variable reduction, treatment of missing values, screening for outliers, and recoding. Our variables will be price, beds, baths, square feet, and $/square feet. The amount of missing values is 0, 290, 288, 303, and 303, respectively.
Also, since “City” is nominal, we need to use Recode. Then, we excluded all properties outside Baton Rouge. We (Recode) or separated Baton Rouge and group other cities as “Others.” Baton Rouge 2,098 reports “Other” cities combined 52. Total as we mentioned, 2,150 reports within Louisiana. We created a new data table, excluded 354 reports, and are ready to perform Clustering analysis. 1,796 rows are left after removing outliers, missing values, and properties outside Baton Rouge.
Two different clustering techniques were performed, K-means and Hierarchical. The K-Means clustering method provided a manageable number of clusters while also providing insight, described below.
The first analysis will be a K-means cluster with several k=5 clusters using the following variables: Price, Beds, Baths, SF, and $/SF. “Cluster Summary” shows that #1 and #3 are the closest with the number of reports 263 and 233 respectively, Appendix A. All variables are relatively close to each other with their “Cluster Means” or distance.
By looking at the means, we can see that five groups (clusters) are divided into two main subgroups, as they are close to each other with their means (distances).
1st group clusters 1,4,5 and 2nd group 2 and 3, would be the closest to each other, as we can also see the Biplot of clusters, Appendix B. Shorter distance between clusters means that they have similar observations of houses within Baton Rouge area and the space that incorporates all dimensions that we chose.
We are not required to perform validation, but if we would need to validate the K- means method, we would use two techniques: cross-validation, and the second method to validate would be to take the sum squared distances to cluster means for k=5.
The K-means offered insight into the problems, variables. To increase the granularity of the analysis, a Hierarchical clustering technique was employed, with the same number of clusters = 5.
Hierarchical Clustering is a step-wise procedure that attempts to identify relatively homogeneous groups of cases based on selected characteristics using an agglomerative or divisive algorithm.
Hierarchical Clustering means Appendix C.
Word’s method used, the sum of squares within the cluster summed over all variables as the similarity measures between sets. It tends to join clusters with a small number of observations, and it is strongly biased on producing clusters with the same shape and numbers of observations.
If we look into “Clustering History,” Cluster 4 and 5 will be the closest means or distances. Also, #4 has the most groups of variables joined together by 184 and 29 Leaders.
The constellation plot reveals three major clusters with branches to sub-clusters and several outliers, Appendix D.
We’ve also created a map of 5 geographic clusters using k-means Clustering, which included five variables: Price, Beds, Baths, SF, and $/SF. The map shows us that clusters 4 and 5 will be closest to each other with their properties that we chose to the “smooth” line on the latitude vs. longitude graph within the Baton Rouge area. Appendix E.
Clustering in JMP provides some insight into the understanding of massive data sets, with the ability to put data into similar groups and focus on the critical patterns with grouped data rather than be overwhelmed by individual differences.
Clusters 1,4, and 5 will be the closest in the K-means method and 3,4 and the Hierarchical approach.
With property type, we have a condo and single-family residential with the most significant percentage.
The advantages of the K-means method that we found out are the ability to work with the large data set, as it clearly shows the pattern and numeric output as distances (means). Also, the results are less susceptible to outliers in the data.
Disadvantages for the K-means method is not efficient when a large number of potential clusters exist. Each cluster solution is a separate analysis. The data can be only numerical.
Advantages for Hierarchical, the data can be either numerical or categorical. It tends to join small clusters with an extensive set of data clusters. Similar to a treelike structure.
Disadvantages for Hierarchical is not suitable for extensive data set—the impact of outliers. To reduce the effects of outliers, the analyst may wish to cluster the data several times.
Overall, I prefer K-Means and Map as they clearly show the results that we wanted.
Research Provided by Andrey Fateev
Appendix A – Table 1. Cluster distribution for K Means
Appendix B – Biplot, K-means, k=5
Appendix C – Hierarchical Clustering Means
Appendix D – Constellation Plot
Appendix E – Map of 5 Geographic Clusters using K-means Clustering.
Appendix F – Hierarchical Clustering with Location and Property Type and Constellation Plot.
Appendix G – Crosstab of Cluster vs. Property Type