Introduction and Business Understanding
The variables included in any statistical model will depend on its purpose, market segment (industry), and spatial/temporal perspective. The goal is to determine what type of variables are significant in predicting house prices within one state.
The four main key factors that we have to consider are demographics, simply the compositions of the population by race, gender, age, and income as a prominent example Beverly Hills in CA versus Baton Rouge. In addition, interest rates, the state and federal economy, and government policies with their determination of property demand and prices on the estate are all critical variables.
If we go online and start searching for a new house, the first questions we usually ask are location, then a type of estate we are looking for, home or condo, size in square feet, and how many rooms and restrooms are within the property.
3V’s. Big data incessantly contains important information or valuable insights, critical for business success in our case, its price predicting. The challenge we might face is analyzing and processing all the data to derive the set of significant variables. Big data is just the data generated from internet-connected devices, can be different types and forms, simply highly diverse.
We might face the size (volume) of the available estates on the website or in data storage. Due to the competing independence of companies with their policies and personal information protection, sometimes is not enough time to collect and analyze the data of different properties. On top of that, companies struggle with analysis due to the variety of data collected, which could be structured, semi-structured or unstructured data, as we all want to see, for example, pictures of the property and not just prices and how many rooms it has or any type of links for the website where this property is located.
All of the data is important to analyze the property but must be prepared correctly following the method chosen. For example, most agencies use regression models and what’s called the hedonic regression method of estimating the demand for a good.
In our research, we would use JMP to find what significant variables are in determining the real estate prices within the Baton Rouge area. Our focus will be on properties with a similar characteristic to avoid any biases.
Data Understanding and Analysis
The “Redfin_2018_Houses” JMP file contains 2,150 records of houses across Louisiana. Before analysis can be performed, a significant amount of data preprocessing was necessary, including variable reduction, treatment of missing values, screening for outliers, and recoding. Since “City” is nominal, we want to use Recode. Then, we excluded all properties outside Baton Rouge, and we have 2098 properties available to analyze within the Baton Rouge area. We can also use a query and filters to exclude properties we think might not be needed for the analysis.
Since there are several property types within the data set, as we stated before, we want to avoid any biases and make a model for properties with a similar characteristic to single-family residents and townhouses and Condo/Co-op. We won’t include multi-family or mobile/manufactured homes as the prices are different and don’t have similar characteristics.
Screening for outliers and missing values was performed.
We have a couple of outliers within the price, days on the market, HOA/month, and lit size, and we want to exclude these rows.
Our variables will be: price, beds, baths, square feet and $/square feet, lot size, year built, days on the market, HOA/month.
The number of missing values is: 0, 1, 0, 1 and 1, 1556, 61, 15, 787, respectively (Appendix A). From 1780 properties left in our data set, the biggest number of missing values would be within “lot size,” “year built,” and “HOA/month.”
The more missing values we have, the worst variable it is for predicting the prices of properties. Therefore, for beds and square feet, we used Automated Imputation to impute those two values. As a result, 3142 missing values were replaced, and dimension two approximation was empirically selected, 1570 values excluded.
Analyze distribution performed for price, beds, baths, square feet, and year built (Appendix B). From distribution analysis, we conclude that most of the properties within the data set are newer properties with 3-5 bedrooms, 2-4 baths, from 1000 to 3000 square feet, and a price range from 200k to 400k dollars. Most of the properties are relatively big, and the market is oversaturated with a big square foot, many beds, and bathes properties within the data set. Los size has a lot of missing values, wasn’t considered for distribution analysis, and year built, especially for the lower-priced properties.
From the map (Appendix C), we added zip code and price. We can conclude that zip codes of the properties are very randomized, show us different areas of the zip on the map, red dots on the left side, some of them on the right looks like a red triangle with a lot of blue dots (different zip code areas) in the center. Prices north of Baton Rouge have a smaller price on properties than in the middle of Baton Rouge.
The average annual income in Baton Rouge is $44,470.00, with the median cost of the houses of $339,950.00. It must be a big challenge to own your property. The average property has four bedrooms and three baths. For the sake of prices and average annual income, companies should have relatively small parcels if we are trying to avoid homelessness. Therefore, beds and bathes play a significant role in predicting the price of real estate. As well as location from the map appendix, zip codes and prices correlated, better community based on zip code meaning higher prices.
Research Provided by Andrey Fateev
Appendix A – Table info, after excluding all missing and outliers.
Appendix B – Distribution.
Appendix C – Map of Zip and Prices.