Canada Diamonds
By: Stenly • Research Paper • 4,112 Words • March 24, 2010 • 1,010 Views
Canada Diamonds
Introduction
The objective of this assignment is to find the best model for predicting the price of a diamond based on the four C’s – Cut, Carat, Clarity, and Color. Our goal is to see which of these variables has the greatest influence on the pricing of diamonds. In order to accomplish this, we will analyze a random selection of 44 round cut and 6 princess-cut diamonds from http://canadadiamonds.com. After analyzing our random sample, we will apply our knowledge of model building to recommend the equation that we believe best predicts the prices of diamonds.
Statistical Methods used for Analysis
After collecting the data (a random sample of 44 out of 183 round diamonds between 0.4 and 1.6 carat, and 6 out of 25 princess cut diamonds of the same carat range) the first action we had to do was transform the data from a test to numerical format in order to use it in determining a model for regression. The predictors for price that were given in text format on www.canadadiamonds.com were: Color, Cut, and Clarity. For Color and Clarity we were able to use a scale provided on the case sheet as these were scaled variables, as in there were better versus worse clarities of diamonds, and common to rare colors. The scale we followed was:
Code 1 2 3 4 5 6 7 8 9 10 11
Clarity I3 I2 I1 SI3 SI2 SI1 VS2 VS1 VVS2 VVS1 F
Color D E F G H I J K L M N+
As for cut since, we were only asked to take a sample of princess and round, we used a 0/1 scale. We chose round to be 0, as round was the more common form of diamond, the standard; princess on the other hand was a special cut of diamond, and would have a more significant impact on price.
Now having a numerical set of data, to get better acquainted with the data set, we looked at the basic statistics.
Descriptive Statistics: Carat, Price, Cut-Dum, Color-Dum, Clarity-Dum
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3
Carat 50 0 0.6846 0.0352 0.2486 0.4000 0.4850 0.6000 0.9025
Price 50 0 2557 245 1731 600 1087 2315 3236
Cut-Dum 50 0 0.1200 0.0464 0.3283 0.0000 0.0000 0.0000 0.0000
Color-Dum 50 0 4.700 0.265 1.876 1.000 3.000 4.500 6.000
Clarity-Dum 50 0 5.760 0.248 1.756 3.000 5.000 6.000 7.000
Using descriptive statistics we took note of the response mean, Price being 2557, this will prove useful later in determining how fitting a model our options were based on the S value. We also compared the means of each data column to its median, as we saw there seemed to be no significant skewing involved. Next, we performed a correlation matrix to determine which variables would have the largest impact on price, and came up with the following:
Correlations: Price, Carat, Cut-Dum, Color-Dum, Clarity-Dum
Carat Cut_Dum Color-Dum Clarity-Dum
Price 0.894 0.037 -0.198 0.164
0.000 0.798 0.168 0.256
The stand out variable in the data set was Carat having, by a good amount, the most significant absolute value at 0.894. Keeping this mind, a scatter plot of the data was then taken of each predictor (carat, cut, color, and clarity) against price. We performed the scatter plot to better determine what sort of regression (linear, quadratic, cubic) each variable could possibly have.
Looking closely at this plot, we determined it was possible that Carat could have a non-linear relationship with Price. We then created two more columns of data, Carat^2, and Carat^.5.
Having all of our data, including the transformations of the predictors, we came up with our best sub-sets as follows:
C
l
C a
o r C
C l i a C
u o t r a