This is however still related to my interest in markets, and how something is priced the way it is, and why. For personal reasons, and sheer curiosity, I decided to tackle how diamonds are priced. Going in, I already knew that diamonds are not rare at all, so price levels are purely driven by demand. Even the "3 months salary" rule is just a brilliant marketing ploy that used to be "1 months salary" in the '30s, then "2 months salary" in the '80s, and "why the hell not 3 if we can make more money off a dick-measuring contest?"
| The 1980s, when a diamond cost 2 months' salary |
So given today's demand for diamonds, which I can't control, what characteristics of the diamond are important in determining a diamond's value? The industry likes to tout the 4 C's of diamonds: Color, Clarity, Cut, and Carat. Which, if any, of these are important, and are there other characteristics? TLDR version: carat, color, clarity, polish, symmetry, and fluorescence. Plus you can get a good idea of how much a change in a particular rating should be worth.
With the help of my friends David Kelley and Vinod Cheriyan, I scraped and cleaned data from Borsheim's website for their round brilliant diamonds. I was able to get prices and characteristics of about 11 thousand loose diamonds of all sizes and quality. Message me if you would like details of what we did. The short version is we tweaked the webpage's source to display 2000 entries at a time (any more and their server started spazzing out).
How to use this table: Say you have a 0.5 carat, G color, VS1, Good cut, Very Good polish, no fluorescence, and Very Good symmetry. You would take the Intercept estimate + Carat estimate * 0.5 + 0.5*0.5 * Carat_sq estimate + G estimate + Clar_VS1 estimate + ..., and exponentiate it to get the predicted price. But this isn't what it's really good for.
I am actually not all that excited about predicting the price of any particular diamond, since there will be some variability among stores and since not all diamonds have a polish, fluorescence, and symmetry rating, and thus you'd have to guess. Some places like Tiffany's might have a tremendous markup on all the prices. The best part about this model is that if you were to compare two diamonds with different characteristics, you get a great idea of how much their prices should differ. Going from a J to an H color predicts a price increase of 24.2%. Going from a VS2 to a SI1 decreases the price by 11.8%. If you've done any shopping, you will no doubt hear salesmen say "You can get a better color for a little more" and "These are the same prices, but it's a tradeoff in color and clarity." Now you actually know how much more should a better color be, and how much exactly the tradeoff should be! One last observation is that the estimates for the Cut ratings are all out of order and very close to each other. That's a strong indication that differences in Cut ratings don't matter, just whether they have a cut rating versus the baseline of "Fair or no Cut rating".
There are several caveats to my model:
At this point, I did some Googling to see if other people have tried this as well. Many have indeed, and some even had a much richer set of data (up to 300,000 diamonds). But they all had serious flaws such as the data being more than 5 years old or assigning integers to the different quality grades. Why are these serious flaws? 7 or so years ago, the world was in a serious recession, and as the BBC article noted, recessions are terrible on the diamond industry. The quality grades are somewhat arbitrary, and assigning consecutive integers to them is saying the difference between grades stays the same.
After much back and forth testing, playing around with different options, I settled on making binary variables out of all the categorical quality grades, regressing on the log price, and using stepwise linear regression to select which variables were important. The reasons I ended up doing what I did were a combination of practicality and technicality.
- Using the log price versus the raw price was a technical reason and is well documented in texts on linear regression. Another thing I did was to throw out any diamonds priced over $100,000. They would merely serve as outliers that throw off the data, and I couldn't afford those anyway.
- Linear regression was more of a practical issue. If you are familiar with statistical regressions, there are many assumptions needed for a linear regression to be valid, and in fact my diamond data violates a couple of those assumptions. But they were "close enough", and linear regression is very easy for a computer to run and very easy to use/interpret.
- The binary variables (for example, Clar_VS1 = 1 if a diamond has a clarity rating of VS1 and is equal to 0 if it was given another rating) made my linear model much more flexible and precise than what other people tried. Using the model in a jewelry store is also a lot easier when I had to just pick out the right binary variables to use. (One technicality is that I had to pick a "baseline" for each of the categorical grades, and I know I screwed up on the Polish rating, so that could be something I fix in the future).
Without further ado, these were the results I got for log(Price):
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This achieved an R-squared of 0.96. |
How to use this table: Say you have a 0.5 carat, G color, VS1, Good cut, Very Good polish, no fluorescence, and Very Good symmetry. You would take the Intercept estimate + Carat estimate * 0.5 + 0.5*0.5 * Carat_sq estimate + G estimate + Clar_VS1 estimate + ..., and exponentiate it to get the predicted price. But this isn't what it's really good for.
I am actually not all that excited about predicting the price of any particular diamond, since there will be some variability among stores and since not all diamonds have a polish, fluorescence, and symmetry rating, and thus you'd have to guess. Some places like Tiffany's might have a tremendous markup on all the prices. The best part about this model is that if you were to compare two diamonds with different characteristics, you get a great idea of how much their prices should differ. Going from a J to an H color predicts a price increase of 24.2%. Going from a VS2 to a SI1 decreases the price by 11.8%. If you've done any shopping, you will no doubt hear salesmen say "You can get a better color for a little more" and "These are the same prices, but it's a tradeoff in color and clarity." Now you actually know how much more should a better color be, and how much exactly the tradeoff should be! One last observation is that the estimates for the Cut ratings are all out of order and very close to each other. That's a strong indication that differences in Cut ratings don't matter, just whether they have a cut rating versus the baseline of "Fair or no Cut rating".
There are several caveats to my model:
- This was only done on data from one jewelry store (Borsheims) and one particular cut of diamond (the round brilliant). While I'm pretty confident the results will hold up for most reputable dealers (luxury stores might be a different story), the different cuts are definitely vulnerable to the vagaries of tastes and fashions (the round brilliant is the most popular though).
- I still have some trouble fitting the high end of the price spectrum. It has a fatter tail than the log-normal distribution. Don't bother pricing the Queen's jewels with this. Plus, they know what the real rare stones are, and they aren't diamonds.
- Maybe a stepwise selection method isn't the best in this case. There is no hard rule on when to use it, but it's possible it left out some important characteristics.
- The bigger the differences between two diamonds you're comparing, the less accurate this will be.
Hope this was informative, and stay tuned for field results!
