Because of the pandemic, a large population of individuals have been forced indoors, and many have been looking to pick up new hobbies that can be done during the deepest darkest of night. One of the hobbies that has been growing much more during the pandemic has been the interest and construction of custom mechanical keyboards. Many people have been switching from more commonplace keyboards, such as membrane keyboard or laptop keyboards, for the change in typing experience. Purchasing or making a mechanical keyboard allows a user to adjust the sound and feel of the keyboard to the user's taste, such as a smoother keypress, a softer typing experience, and sound that can be tuned to the users preference.
In the modern era, it is becoming more and more unusual to be keyboard-free in the fulfillment of your profession. Because of the pandemic, many working people have been forced to transition online, where meetings no longer take place in person, and people spend an increasingly large amount of time in front of the computer. With that in mind, it makes sense that you would want this increased time to be more enjoyable; a worker takes pride in their tools.
Unfortunately, those entering the keyboard scene often find it difficult to breach, and the amount of knowledge required to feel confident in making an informed decision can be overwhelming. For those looking for an easy way to buy and sell keyboards, many turn to reddit's r/mechmarket, an extremely popular hub for people to trade used and unused keyboard items. Unfortunately, because of the nature of the keyboard market, almost all popular items are exclusive, and are only sold once ever. This generates an economy much like trading cards, where the older models seem to exponentially increase in price over time, with one user even listing a keyboard for \$7000 on r/mechmarket.
These seemingly outrageous prices raise the question: how much are keyboards actually worth, and how much might they appreciate in value? Although keyboards generally run through sales on each designers website, we can scrape through r/mechmarket to get an idea what people are actually willing to spend on mechanical keyboards, and how much the hobby needs a consistent, high quality product thats in-stock as a gateway to the hobby.
import requests
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import warnings
Thankfully, because reddit is so well established, there are API's available for querying the site. Reddit has its own API, which has many great options, but we are going to use the pushshift
Reddit API . This will allow us to query all posts from a specific time period from a specific subreddit. We can load the data using requests
, and take a look at what information we can actually glean.
headers = {'User-Agent': 'cheb'}
groupbuys = requests.get('https://api.pushshift.io/reddit/search/submission/?subreddit=mechmarket&metadata=true&before=1620792000', headers=headers)
d = json.loads(groupbuys.text)
Below is an example of what a query like above returns. The website includes data about each post within the time period, limited to a maximum of 500 threads per query. This metadata includes information such as the name of the poster, the title of the thread, the text in the thread, what time it was posted, and much more. What we're interested in is the link_flair_text
, which can identify a seller, and the selftext
, which is the contents of the title. Redditors on the subreddit are required to either have a flair text that identifies them as a seller, or announce their selling status in their title (a requirement that is no longer used after the implementation of the flair text).
In order to query the API, we need to be able to grab all the results from each day. Because pushshift
uses unix time, we can grab the unix time from the first day that the subreddit was created, November 3rd, 2012, and go until today. An easy place to do conversions without having to do any code is to go to this website.
# unix time is in seconds
start = 1351915200
end = 1620878400
sec = 86400
days = (end-start)/sec
print("Days to query: "+str(int(days)))
Unfortunately, people sell tons of stuff related to keyboards on mechmarket. However, if that were the only problem, we would all be super happy (mostly me). Even worse than quantity is the lack of required formatting; many sellers rely on buyers knowing what parts are what just by name. Names of such items include:
An older individual without keyboard knowledge might not be able to distinguish between some of these parts and Pokemon. This makes collecting data about what people are selling extremely hard. In order to query reddit effectively, we have to have some domain-specific knowledge to filter out the wheat from the chaff. To have a chance of distilling any information, we are going to focus specifically on keyboard kits. Keyboard kits typically include:
We will specifically be ruling out selling switches alone, keycaps alone, and other keyboard paraphernalia without the board itself.
By identifying common indicators that are specifically not keyboard cases, we can try out best to filter out items.
# common abbreviations for keycaps
caps = ['GMK','pbt','ePBT','epbt','SA','XDA','DSA','SP','Tai Hao','OEM','PG','ABS','DCS','POM','local','Local',\
'EPBT','gmk','Gmk']
We can also define our table of data using pandas
that we actually want to store from each query: the price listing, the text associated with the listing, the time of the listing, and the link to the listing. pandas
is an extremely common data storing/manipulation library, with tons of documentation that can be found here.
sellers = pd.DataFrame(columns = ['price','line_text','post_time','link'])
sellers
Finally, we can start our range of queries from the start of the subreddit to today by editing the https request directly. The API doesn't like it when a user sends too many requests in a row, and will reject the request if too many are sent, so if an invalid response is returned we can just wait until the API is ready for us.
import time
current = start
# initialize the link
l = "https://api.pushshift.io/reddit/search/submission/?subreddit=mechmarket&metadata=true&size=500&"
after = "after="
before = "&before="
headers = {'User-Agent': 'cheb'}
# iterate through every day
for a in range(1,int(days)+1):
next = current+sec
link = l+after+str(current)+before+str(next)
current = next
groupbuys = requests.get(link, headers=headers)
while not '200' in str(groupbuys):
time.sleep(.8)
groupbuys = requests.get(link, headers=headers)
d = json.loads(groupbuys.text)['data']
From the data, we want to determine if the post has any body, as some posts with deleted users can have their data removed. We also want to make sure the user is selling; on r/mechmarket, users can indicate whether they intend to sell, buy, trade, or a number of other options. However, we only care about those who are selling, and in that subset we only care about those selling keyboard kits. However, these categories were not implemented until later on in the lifetime of the subreddit. Prior to these tags, users typically put [selling] in the title, and later [WTS] for 'want to sell'. We need to scan through the title and the metadata to see whether or not items are being sold at all.
for a in d:
if 'selftext' in a:
# Chcek to see if the user is selling - sometimes people also mess around with capitalization
if ("link_flair_text" in a and a["link_flair_text"]=='Selling') or \
('Selling' in a['title'] or 'selling' in a['title'] or 'wts' in a['title'] or 'WTS' in a['title']):
post_time = a['created_utc']
# take the unix time listed and turn it into a
time_obj = datetime.utcfromtimestamp(post_time).strftime('%Y-%m-%d %H:%M:%S')
artisan_cap = False
text = a['selftext']
split = text.splitlines()
url = a['url']
Now that we're actually into the text itself, we have new problems. Some sellers have huge stores of items, and sell mutliple items per a single post. This brings up a problem: we can't just eliminate a post just because it says 'keyset' in the text, and if a poster has multiple keycaps listed in a table, we have to make sure none of them are counted. We need to go through each line to check for an individual pricing. We can also begin to filter out non-keyboard items by limited the prices to over \$100. Even though high quality keycaps routinely go over \$150, this filtering can be a start.
Below is picture of a real posting, which includes around 20 items, not all of which are keyboards
for line in split:
temp = re.findall(r'\$(\d+)', line)
# if the seller is selling something >$1000, it is almost never just accessories
if max(temp)>1000:
res = list(map(int, temp))
filtered = list(filter(lambda x: x>100,res))
for price in filtered:
sellers.loc[len(sellers.index)] = [price,line,time_obj,url]
else:
# check to make sure we're not dealing with a table scenario
if ('artisan' in line) or ('Artisan' in line) or ('cap' in line):
artisan_cap = True
else:
if not artisan_cap or 'board' in line:
artisan_cap = False
if not any(cap in line for cap in caps):
res = list(map(int, temp))
# only get prices more than $100
# this helps filter out keyboards, and sets a baseline for what
# quality of keyboard we're looking to track
filtered = list(filter(lambda x: x>100,res))
for price in filtered:
sellers.loc[len(sellers.index)] = [price,line,time_obj,url]
After querying for about an hour, through 390,000 total posts, we can finally get our list of items.
sellers.head()
len(sellers)
Looking at the data, we get nearly 18,000 selling posts that have been tagged by the user as selling keyboards.
Unfortunately, we still need to do a deep clean. For example, many people also post multiple times, since many posts can be left unseen throughout the week due to sheer volume. We need to go in and do some manual cleaning up of the data, including removal of duplicates and removal of specific items our first filter failed to catch. Unfortunately, lots of this extra filtering requires domain knowledge. It is impossible to catch everything, since there are no standards, but we can try out best to filter out items based on their anticipated prices and titles.
Many of these deletions focus on low price - this is because people are less likely to be selling only accessories at the \$200 and up level. Many popular accessories can sell for >\$100, but at a higher price point sometimes these items are included in the build. For example, Gateron Black Inks are an extremely popular switch type, but because of this are often included in more expensive boards. However, if someone is selling a board below \$200 that has Gateron black inks included, usually either the board is very cheap, or the seller is selling the switches on their own.
# Remove Duplicates
sorted = sellers.drop_duplicates(subset='line_text').sort_values(by='price')
#remove instances of '90x ...'
sans_switch = sorted.drop(sorted[(sorted['line_text'].str.contains('\d\dx[^0-9]'))].index)
# At less than $150, including the number of Gateron switches often indicates the user is selling
# just switches
ess = sans_switch.drop(sans_switch[(sans_switch['line_text'].str.contains('\d\d Gateron')) & \
(sans_switch['price']<150)].index)
# gmk stabilizers are not the same as keycaps, and dont inflate the price drastically, but people like
# to mention that they are included
stabs = ess[(ess['line_text'].str.contains('gmk(.*)stab'))]
# drop many cases of keycap selling
ess2 = ess.drop(ess[(ess['line_text'].str.contains('base'))].index)
ess2 = ess2.drop(ess2[(ess2['line_text'].str.contains('gmk'))| ess2['line_text'].str.contains('Gmk')].index)
ess3 = ess2.append(stabs)
ess4 = ess3.drop(ess3[(ess3['line_text'].str.contains('keyset')) & (ess3['price']<350)].index)
ess4 = ess4.sort_values(by='price').reset_index(drop=True)
# Lets work in only USD
ess4 = ess4.drop(ess4[ess4['line_text'].str.contains('[Aa][Uu][Dd]|[Cc][Aa][Dd] ')].index)
We still have more to remove. Looking through the list of items, we can identify specific cases where individuals frequently sell certain types of items with certain keywords, so we can try out best to do targeted removals
# one guy jokingly offered to sell something worth $20 for $10000000
ess4 = ess4.drop([14250])
# more stabilizers that are high quality, but aren't just switches
z_stab = ess4[ess4['line_text'].str.contains('eal stab') & (ess4['price']<150)]
# specific instances of individuals selling only switches
ess4 = ess4.drop((ess4[((ess4['price']<200) & (ess4['line_text'].str.contains('x\d\d')) |\
((ess4['price']<150) & ((ess4['line_text'].str.contains('zeal') | \
ess4['line_text'].str.contains('ateron black') | ess4['line_text'].str.contains('Zeal') |\
ess4['line_text'].str.contains('ateron Black')))))]).index)
ess4 = ess4.drop(ess4[((ess4['price']<=105) & (ess4['line_text'].str.contains('ateron Black')))].index)
# keep some keyboards with stabilizers
ess4.append(z_stab)
# remove instances of people selling just Tealios
ess4 = ess4.drop(ess4[((ess4['price']<=140) & (ess4['line_text'].str.contains('ealios')))].index)
# remove instances of people selling ultrasonicated (washed) switches
ess4 = ess4.drop(ess4[(ess4['price']<150) & ess4['link'].str.contains("ultrasonic")].index)
# FINALLY sort the values based on price
ess4 = ess4.sort_values(by='price').reset_index(drop=True)
This process would be extremely difficult without domain knowledge, which is just unfortunately the problem with scraping through a public forum such as reddit. Collecting data from a more well formatted site could be easier.
Now that we finally have some cleaner data, we can take a look at it in its final form.
ess4.head()
len(ess4)
We managed to cut about 4000 more entries from the list, which leaves us with a whopping 14,000 sales to comb through.
Let's just start by plotting all of the data, price versus time. Since we used datetimes, we need to be able to format the dates to be able to use it on an axis, so we can us mdates
to go on the axes of a matplotlib
plot. matplotlib
is very well documented, and their webpage includes tons of examples for plotting data.
import matplotlib.dates as mdates
ess_temp = ess4.copy()
ess_temp['post_time'] = pd.to_datetime(ess_temp['post_time'], format='%Y-%m-%d %H:%M:%S')
plt.figure(figsize=(18,10))
# make a scatter plot of price vs time
plt.scatter(ess_temp['post_time'],ess_temp['price'])
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.xlabel("Posting Time")
plt.ylabel("Price ($)")
plt.title("Prices vs Time")
plt.show()
Ok, uhhh, not gonna lie - that sucks.
BUT
We can try to make it not suck. We can start by taking the time apart and splitting it into months, half years, and quarter years. Because theres so much data over such a long period of time, it will be difficult to visualize the data, so we can try a bunch of different bins to see whats eassier on the eyes, but doesnt obscure the data too much.
# split into months + year
ess4['month']=ess4['post_time'].apply(lambda x: x[:7])
# split into just year
ess4['year']=ess4['month'].apply(lambda x: x[:4])
# split
ess4['half']=ess4['month'].apply(lambda x: float(x[:4]+str(float(math.floor(int(x[5:7])/6)/2))[1:]))
ess4['quarter']=ess4['month'].apply(lambda x: x[:4]+'Q'+str(math.ceil(int(x[5:7])/4)))
ess4.head()
We can use matplotlib
to visualize the bins using violin plots, which show the density of different prices, along with their locations. It seems like using half years makes the densest, but still visible, distributions. Going to quarter increases the number of distributions too much, and year results in only a few.
# Takes a dataframe and plots the data
# in this case, this function will specifically use the 'half' time scale
# which is implemented by counting the earlier half of the year as 20XX.0, and
# the latter as 20XX.5
def plot_violin(df, t,label):
warnings.filterwarnings("ignore")
test = dict()
df['half'] = df['half'].astype('float32')
for index, row in df.iterrows():
if not row['half'] in test:
test[row['half']] = np.array(row[t])
else:
test[row['half']] = np.append(test[row['half']],row[t])
plt.figure(figsize=(18,10))
a = plt.violinplot(list(test.values()),list(test.keys()),showmedians=True,widths=.5);
for b in a['bodies']:
b.set_edgecolor('black')
plt.xlabel("Year")
plt.ylabel(label)
plt.title("Violin Plot");
return a
plot_violin(ess4,'price','Price ($)');
Unfortunately, the outliers make this plot extremely difficult to read. To help give us a better idea, we can take a look at some boxplots, with and without outliers. Thankfully, pandas
provides out of the box boxplot methods for us to use.
warnings.filterwarnings("ignore")
ax = ess4.boxplot(column = 'price',by = 'half',grid = False,figsize=(18,10));
ax.set_xlabel("Year")
ax.set_ylabel("Price ($)")
ax.set_title("")
This is still bad, but lets take a look without outliers just to get an idea.
warnings.filterwarnings("ignore")
ax = ess4.boxplot(column = 'price',by = 'half',grid = False,figsize=(18,10),showfliers=False);
ax.set_xlabel("Year")
ax.set_ylabel("Price ($)")
ax.set_title("")
Clearly we can see that the maximum price for second-hand keyboards is increasing, and it seems like generally the median is increasing over time, save for the first year of the subreddit's existence.
Now that we can actually make sense of the data visually, we want to be able to get some real quantitative conclusions from it. We can do some basic math to hopefully give us a better idea relationships within the data. First of all, because we capped the bottom at 100, and prices have no maximum, we can try to scale down to account for massively inflated prices by taking the log.
ess5 = ess4.copy()
ess5['log'] = ess5['price'].apply(lambda x: np.log(x))
ess5.head()
Now that we've done some basic operations, we can take a look at what those operations has done to the data distribution itself.
plot_violin(ess5,'log','log(price)');
When we look at the data, its clear that 2013 was an unusual year, since it was the very beginning of r/mechmarket. From what it seems like, sellers were unsure of what their items could actually be valued at, so many sellers either took a higher, but conservative pricing, or a lower, more ordinary pricing, generating a more bimodal distribution, and one with less variance. Because of this, we can say that in general, 2013 does not match the current climate of r/mechmarket, and to find a more general pattern we are going to remove both 2013H1 and 2013H2.
sans_2013 = ess5[ess5['half']>=2014]
sans_2013.head()
plot_violin(sans_2013,'log','log(price)')
Looking at the data, it's obvious that some of the bins are clearly not perfectly gaussian. Although some of the half-years are slightly more gaussian than others, many of them show a distinct bimodality rather than the linear relationship I was originally hoping to see. Based on the data, it seems that in general, there are a lot of people willing to buy and sell "budget-friendly" boards, but there is also a steadily growing group of "enthusiast" buyers, individuals looking to spend more money on a higher quality or more aesthetic product. There is also a third category that is slowly developing, which is the "collectors" - people willing to spend exorbitant amounts of money on their keyboards.
Even though the data is not perfectly linear, we can still try to figure out if the price of keyboards is increasing over time. We can use the Pearson's Correlation Test to see:\ a) how strong the relationship is\ b) how likely it is that there is no relationship
from scipy import stats
from datetime import datetime
from datetime import timezone
#datetime.strptime('2014-03-21 23:16:32', '%Y-%m-%d %H:%M:%S').replace(tzinfo=timezone.utc).timestamp()
unix_times = sans_2013.copy()
unix_times['u_time'] = unix_times['post_time'].apply(lambda x: int(datetime.strptime(x, '%Y-%m-%d %H:%M:%S').replace(tzinfo=timezone.utc).timestamp()))
time = list(unix_times['u_time'])
prices = list(unix_times['price'])
coefficient, p_value = stats.pearsonr(time,prices)
print("P-value : "+str(p_value))
print("Strength : "+str(coefficient))
We can see that the p-value is extremely low, where $p<.001$, but what does that actually mean? The low p-value means that is we assume the price of keyboards has not gone up at all or that the price is random and not associated with year, the probability of collecting this data would be less than .1%. This generally means that we are really quite confident that the price is increasing each year.
Of course, this confirms what we can see visually - looking at the original violin and box plots of the data, in general the center of the data and the distribution increase each year. This is even more clear in the plots with outliers removed and the log taken. The density of keyboard prices slowly creeps up and up over the years, much like a toothpaste bottle.
Now that we know what the data looks like in a cleaned up state, and can see some relationships, we can tell that the data seems to be forming around these three groups, the "budget-friendlies", the "enthusiasts", and the "collectors". With that in mind, we can try to do some clustering to figure out where exactly these clusters might be forming, and at what prices. sklearn
is a popular machine learning library that features many classification, regression, and clustering algorithms, and is thoroughly documented here.
Looking at the data, we can kind of convince ourselves that the groups can be modelled with gaussian distributions - each group has an average amount they are willing to spend, with an assumption of normality. Because of this data form, we can try to use a Gaussian Mixture Model from sklearn
. Gaussian mixture models are used for exactly this purpose - the model will approximate areas where there appears to be some combination of Gaussian distributions, and use the distributions as class labels. GMM's have an advantage of being very fast algorithms, and have a lot of flexibility.
import numpy as np
from sklearn import mixture
Fitting a Gaussian mixure model is actually very easy. sklearn
allows you to do most machine learning algorithms with only a couple lines, and the presets in general are pretty good!
# as long as the data is formatted in a 2D array, we're good to go
ted2 = ted.reshape(len(ted),1)
# the number of clusters you're anticipating is required
g = mixture.GaussianMixture(n_components=3)
g.fit(ted2)
Unfortunately, the hard part is the visualization. Sure its great to be able to fit the data, but what we really want to do is to be able to see the Gaussian directly on our violin plots. Doing this would help visually identify exactly where each mean is, and allows us to see the relationship between the model output and the actual data distribution. We can start by making a similar violin plotting method to begin our visualization. We also want to include lists for our three classification groups - we want to keep track of the means identified of each group for later.
fancy = []
nice = []
standard = []
def plot_violin2(df, t,label):
figure, axes = plt.subplots(figsize=(18,10))
Then, we can start to fit the data for each half year, and use the models output to populate our lists
for cur_half in df.half.unique():
# get just the current year
ted = np.array(sans_2013[sans_2013['half']==cur_half]['log'])
ted2 = ted.reshape(len(ted),1)
# fit the model
g = mixture.GaussianMixture(n_components=3)
g.fit(ted2)
means = np.ndarray.flatten(g.means_)
means = list(zip(means,range(0,3)))
means.sort()
# make sure to keep all of your fitted means
fancy.append([cur_half,means[2][0]])
nice.append([cur_half,means[1][0]])
standard.append([cur_half,means[0][0]])
Then, we need to actually draw circles for where the means are. In this case, I chose to draw concentric circles using matplotlib
to help visualize the gaussian distribution, but other methods are available!
weights = g.weights_
colors = ['plum','mediumpurple','darkorchid']
for a in range(len(means)):
Drawing_uncolored_circle = plt.Circle((cur_half,means[a][0]),weights[means[a][1]]/max(weights)*.25,\
fill = False,color=colors[0])
axes.add_artist( Drawing_uncolored_circle )
Drawing_uncolored_circle = plt.Circle((cur_half,means[a][0]),weights[means[a][1]]/max(weights)*.20,\
fill = False,color=colors[1])
axes.add_artist( Drawing_uncolored_circle )
Drawing_uncolored_circle = plt.Circle((cur_half,means[a][0]),weights[means[a][1]]/max(weights)*.15,\
fill = False,color=colors[2])
axes.add_artist( Drawing_uncolored_circle )
axes.scatter([cur_half],means[a][0],c='m')
Finally, we can draw the violin plot, and see where this takes us
warnings.filterwarnings("ignore")
test = dict()
df['half'] = df['half'].astype('float32')
# extract the half year data
for index, row in df.iterrows():
if not row['half'] in test:
test[row['half']] = np.array(row[t])
else:
test[row['half']] = np.append(test[row['half']],row[t])
# plot the actual violin plot
m = axes.violinplot(list(test.values()),list(test.keys()),showmedians=True,widths=.5);
# extra customization options!
for b in m['bodies']:
b.set_edgecolor('black')
b.set_facecolor('#9DE9E4')
axes.set_xlabel("Year")
axes.set_ylabel(label)
figure.suptitle("Violin Plot");
# return, just in case we need to add some plots
return figure,axes
And here's the plot.
plot_violin2(sans_2013,'log','log(price)');
It definitely looks pretty interesting. Over time, its clear that the "collector" group is growing rapidly, and this year the number of extremely large sales is a large fraction of total sales.
Having a the data for the three different groups is great, but what if we could do more? In general, these groups also follow a positive linear relationship. Not only that, because we used a Gaussian mixture model, the clusters are already assumed to be Gaussian. Realistically, this is a great place to perform a linear regression, something that sklearn
also offers. Performing the linear regression with sklearn
is just as easy as using the Gaussian mixture model, and can be done in one line.
from sklearn.linear_model import LinearRegression
# fits the linear regression model
# in this case, we are specifically using the means from the fancy cluster
reg = LinearRegression().fit(np.array(fancy)[:,0].reshape(len(fancy),1),np.array(fancy)[:,1].reshape(len(fancy),1))
print("slope: " + str(reg.coef_[0]))
print("intercept: " + str(reg.intercept_))
Based on the linear regression, it seems like a "collector" can expect a $0.15 increase each year.
But wait, that doesn't seem right at all.
In fact, this linear regression is not a linear re`gression with regards to the price, it's a linear regression with regards to the logarithm of the price. The actual formula solved is:
So the actual increase per year is:
From 2021 to 2022, thats a whopping \$140 increase from \$840 to \$980 for collectors!
Of course, we still have to perform the linear regression on the remaining groups, and it would be great to visualize it, so lets use our last used violin plotter:
fig, ax = plot_violin2(sans_2013,'log','log(Price)');
We're going to use a dictionary to store the predicted output of each linear regression - that way we can do it all in one go without needing to compute the residual right away.
price_points = [standard,nice,fancy]
names = ['standard','nice','fancy']
regressions = dict()
colors = ['b','g','r']
Now, lets actually perform the regression on each cluster
# iterate through each cluster
for price in range(len(price_points)):
l = len(price_points[price])
p = price_points[price]
# perform the linear regression
reg = LinearRegression().fit(np.array(p)[:,0].reshape(l,1),np.array(p)[:,1].reshape(l,1))
# output the coefficients
print("slope: " + str(reg.coef_[0]))
print("intercept: " + str(reg.intercept_))
X = list(np.arange(2014,2021.5,.5))
Y = reg.predict(np.array(X).reshape(len(X),1))
# this is the important part: save the predictions!
regressions[names[price]] = [X,Y]
ax.plot(X,Y,color=colors[price],label=names[price])
ax.legend()
For the three categories, the expected increase is as follows:
Standard:
This year, the model predicts that the average for an enthusiast board on r/mechmarket will increase from \$139 to \$144 - an increase of approximately \$5.
Enthusiast:
This year, the model predicts that the average for an enthusiast board on r/mechmarket will increase from \$327 to \$361 - an increase of approximately \$30.
Collector:
This year, the model predicts that the average for a collectors board on r/mechmarket will increase from \$840 to \$980 - an insane jump of \$140.
Now that we have our linear regressions, we can finally do our sanity check and look at the residuals. Because we stored all the predictions earlier, we dont need to run the models again, we can just assign each years expected price to dictionaries, subtract it from the observed price, and send it to a pandas
DataFrame for nice viewing.
# We want all three types of boards to be in the DataFrame
residual_frame = pd.DataFrame(columns=['fancy','nice','standard'])
reg = dict()
# iterate through each cluster
for a in range(len(names)):
reg1 = dict()
f1 = np.array(price_points[a])
a0 = f1[:,0]
a1 = f1[:,1]
reg2 = {a0[i]: a1[i] for i in range(len(a0))}
# calculate the residual for each year per cluster
for half in range(len(regressions[names[a]][0])):
year = regressions[names[a]][0][half]
reg1[year] = reg2[year]-regressions[names[a]][1][half]
# put the residuals into the DataFrame
residual_frame[names[a]] = np.ndarray.flatten(np.array(list(reg1.values())))
residual_frame
Of course, why look at numbers when we can look at charts. These violin plots require no extra user-defined methods, as we only want to look at three violin plots, so let's just plot them directly.
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(18,10));
# Collector plot
ax1.set_title('Collectors Residuals')
ax1.violinplot(residual_frame['fancy'])
# Enthusiast plot
ax2.set_title('Enthusiast Residuals')
ax2.violinplot(residual_frame['nice'])
# Standard plot
ax3.set_title('Budget-Friendly Residuals')
ax3.violinplot(residual_frame['standard']);
Looking at the residuals, they mostly appear to be centered with a Gaussian distribution at zero. For the residuals of the collectors, there is one extreme residual, stemming from an instance where the model struggled to separate the standard Gaussian from the enthusiast Gaussian clusters, and the same can be said about the enthusiast clusters. The standard residuals are the most unusual of the three, but part of this unusal shape is from the fact that we capped the price to \$100. That cap is a decision we made arbitrarily to limit our results to higher end boards, but knowing what we know now, it might be a good idea to go back and redo the data collection stage to include all boards selling for less than \$100. Of course, the data pruning stage would likely prove to be much more intense. These distributions help meet the Gaussian assumption of a linear regression. To look for homoscedasticity, we can plot the residuals as a scatterplot.
# grab all the years to plot with the residuals
years = np.arange(2014,2021.5,.5)
# plot all 3 residuals along side each other
plt.scatter(years,residual_frame['fancy'],c='r',label='Collector')
plt.scatter(years,residual_frame['nice'],c='b',label = 'Enthusiast')
plt.scatter(years,residual_frame['standard'],c='g',label = 'Budget')
plt.legend()
# plot the regression line at 0
plt.plot([2014,2021],[0,0],c='k')
The residuals are mostly evently distributed around 0. There are two unusual spots at 2019.5, where the model struggled to separate the standard Gaussian from the enthusiast Gaussian clusters, but otherwise the residuals are fairly well behaved. With this plot we can see that all the assumptions required for a linear regression are met:
Just because the assumptions are met doesn't mean that the regression is true. However, it does help justify our use of the regression to glean some information about the relationship between price and time.
Based on the data collected, it seems like yes, the prices of keyboards on the second hand market are increasing. However, these prices aren't increasing at the same rate for everyone. If you're just looking to buy a nice affordable keyboard, for now it seems like people will continue to sell at a lower price, and as the quality of keyboards goes up and up and the size of the hobby grows, buying a keyboard on a "budget" will likely be increasingly more value for price. For enthusiasts, yes the prices are going to increase. Based on the model, it seems like this year an average board will increase in price by about \$30 dollars. When you're already spending about \\$300-400 on your board though, that 10% increase might seem like a reasonable amount to wait for a board you really like. The biggest price hikes are going to be for collectors, at a predicted increase of about \$140. Fortunately, this price increase affects the least amount of people. For many collectors who are willing to shell out any amount of money for their 'end-game' board, that increase is not really a big deal, and when you've already sunk such a large sum of money into keyboards, its likely that a collector will either stop buying altogether or keep investing in their hobby.
For people in each category, finding deals that place you lower than the predicted average price for this year would be a great place to start if you're looking for a deal. At least for ordinary people or people who are just getting into the hobby, you can still afford to wait without being worried about to breaking the bank to get in on the hype.
Bonus unusual plot