How Much Should you Really be Paying for a Custom Mechanical Keyboard?

r/chebmaster

bear.jpg


Part 0: Whats the deal with mechanical keyboards anyways

Introduction

 Because of the pandemic, a large population of individuals have been forced indoors, and many have been looking to pick up new hobbies that can be done during the deepest darkest of night. One of the hobbies that has been growing much more during the pandemic has been the interest and construction of custom mechanical keyboards. Many people have been switching from more commonplace keyboards, such as membrane keyboard or laptop keyboards, for the change in typing experience. Purchasing or making a mechanical keyboard allows a user to adjust the sound and feel of the keyboard to the user's taste, such as a smoother keypress, a softer typing experience, and sound that can be tuned to the users preference.

 In the modern era, it is becoming more and more unusual to be keyboard-free in the fulfillment of your profession. Because of the pandemic, many working people have been forced to transition online, where meetings no longer take place in person, and people spend an increasingly large amount of time in front of the computer. With that in mind, it makes sense that you would want this increased time to be more enjoyable; a worker takes pride in their tools.

 Unfortunately, those entering the keyboard scene often find it difficult to breach, and the amount of knowledge required to feel confident in making an informed decision can be overwhelming. For those looking for an easy way to buy and sell keyboards, many turn to reddit's r/mechmarket, an extremely popular hub for people to trade used and unused keyboard items. Unfortunately, because of the nature of the keyboard market, almost all popular items are exclusive, and are only sold once ever. This generates an economy much like trading cards, where the older models seem to exponentially increase in price over time, with one user even listing a keyboard for \$7000 on r/mechmarket.

 These seemingly outrageous prices raise the question: how much are keyboards actually worth, and how much might they appreciate in value? Although keyboards generally run through sales on each designers website, we can scrape through r/mechmarket to get an idea what people are actually willing to spend on mechanical keyboards, and how much the hobby needs a consistent, high quality product thats in-stock as a gateway to the hobby.

Part 1: Reddit Problems

Data collection

 We can start by setting up some necessary and popular libraries that will allow us to do data analysis and maniulation, such as pandas and numpy, two libraries that handle computation, and requests, which will handle contacting the API's.

In [659]:
import requests
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import warnings

Thankfully, because reddit is so well established, there are API's available for querying the site. Reddit has its own API, which has many great options, but we are going to use the pushshiftReddit API . This will allow us to query all posts from a specific time period from a specific subreddit. We can load the data using requests, and take a look at what information we can actually glean.

In [ ]:
headers = {'User-Agent': 'cheb'}
groupbuys = requests.get('https://api.pushshift.io/reddit/search/submission/?subreddit=mechmarket&metadata=true&before=1620792000', headers=headers)
d = json.loads(groupbuys.text)

Below is an example of what a query like above returns. The website includes data about each post within the time period, limited to a maximum of 500 threads per query. This metadata includes information such as the name of the poster, the title of the thread, the text in the thread, what time it was posted, and much more. What we're interested in is the link_flair_text, which can identify a seller, and the selftext, which is the contents of the title. Redditors on the subreddit are required to either have a flair text that identifies them as a seller, or announce their selling status in their title (a requirement that is no longer used after the implementation of the flair text).

image_2021-05-15_043445.png

In order to query the API, we need to be able to grab all the results from each day. Because pushshift uses unix time, we can grab the unix time from the first day that the subreddit was created, November 3rd, 2012, and go until today. An easy place to do conversions without having to do any code is to go to this website.

In [ ]:
# unix time is in seconds
start = 1351915200
end = 1620878400
sec = 86400
days = (end-start)/sec
print("Days to query: "+str(int(days)))
Days to query: 3113

 Unfortunately, people sell tons of stuff related to keyboards on mechmarket. However, if that were the only problem, we would all be super happy (mostly me). Even worse than quantity is the lack of required formatting; many sellers rely on buyers knowing what parts are what just by name. Names of such items include:

  • KAT Milkshake
  • Kailh Box Brown
  • Think6.5
  • Gateron Ink Black's

An older individual without keyboard knowledge might not be able to distinguish between some of these parts and Pokemon. This makes collecting data about what people are selling extremely hard. In order to query reddit effectively, we have to have some domain-specific knowledge to filter out the wheat from the chaff. To have a chance of distilling any information, we are going to focus specifically on keyboard kits. Keyboard kits typically include:

  • The keyboard case (the shell)
  • The pcb (the brain)
  • The plate (the support)
  • People may also sell switches attached to the board or as a deal

We will specifically be ruling out selling switches alone, keycaps alone, and other keyboard paraphernalia without the board itself.

By identifying common indicators that are specifically not keyboard cases, we can try out best to filter out items.

In [661]:
# common abbreviations for keycaps
caps = ['GMK','pbt','ePBT','epbt','SA','XDA','DSA','SP','Tai Hao','OEM','PG','ABS','DCS','POM','local','Local',\
        'EPBT','gmk','Gmk'] 

We can also define our table of data using pandas that we actually want to store from each query: the price listing, the text associated with the listing, the time of the listing, and the link to the listing. pandas is an extremely common data storing/manipulation library, with tons of documentation that can be found here.

In [660]:
sellers = pd.DataFrame(columns = ['price','line_text','post_time','link'])
sellers
Out[660]:
price line_text post_time link

Finally, we can start our range of queries from the start of the subreddit to today by editing the https request directly. The API doesn't like it when a user sends too many requests in a row, and will reject the request if too many are sent, so if an invalid response is returned we can just wait until the API is ready for us.

In [ ]:
import time
current = start
# initialize the link 
l = "https://api.pushshift.io/reddit/search/submission/?subreddit=mechmarket&metadata=true&size=500&"
after = "after="
before = "&before="
headers = {'User-Agent': 'cheb'}

# iterate through every day
for a in range(1,int(days)+1):
    next = current+sec
    link = l+after+str(current)+before+str(next)
    current = next
    groupbuys = requests.get(link, headers=headers)
    while not '200' in str(groupbuys):
        time.sleep(.8)
        groupbuys = requests.get(link, headers=headers)
    d = json.loads(groupbuys.text)['data']

 From the data, we want to determine if the post has any body, as some posts with deleted users can have their data removed. We also want to make sure the user is selling; on r/mechmarket, users can indicate whether they intend to sell, buy, trade, or a number of other options. However, we only care about those who are selling, and in that subset we only care about those selling keyboard kits. However, these categories were not implemented until later on in the lifetime of the subreddit. Prior to these tags, users typically put [selling] in the title, and later [WTS] for 'want to sell'. We need to scan through the title and the metadata to see whether or not items are being sold at all.

In [ ]:
    for a in d:
        if 'selftext' in a:
            # Chcek to see if the user is selling - sometimes people also mess around with capitalization
            if ("link_flair_text" in a and a["link_flair_text"]=='Selling') or \
            ('Selling' in a['title'] or 'selling' in a['title'] or 'wts' in a['title'] or 'WTS' in a['title']):
                post_time = a['created_utc']
                
                # take the unix time listed and turn it into a
                time_obj = datetime.utcfromtimestamp(post_time).strftime('%Y-%m-%d %H:%M:%S')
                artisan_cap = False
                text = a['selftext']
                split = text.splitlines()
                url = a['url']

 Now that we're actually into the text itself, we have new problems. Some sellers have huge stores of items, and sell mutliple items per a single post. This brings up a problem: we can't just eliminate a post just because it says 'keyset' in the text, and if a poster has multiple keycaps listed in a table, we have to make sure none of them are counted. We need to go through each line to check for an individual pricing. We can also begin to filter out non-keyboard items by limited the prices to over \$100. Even though high quality keycaps routinely go over \$150, this filtering can be a start.

Below is picture of a real posting, which includes around 20 items, not all of which are keyboards

image_2021-05-15_050533.png

In [ ]:
                for line in split:
                    temp = re.findall(r'\$(\d+)', line)
                    # if the seller is selling something >$1000, it is almost never just accessories
                    if max(temp)>1000:
                        res = list(map(int, temp))
                        filtered = list(filter(lambda x: x>100,res))
                        for price in filtered:
                            sellers.loc[len(sellers.index)] = [price,line,time_obj,url]
                    else:
                        
                        # check to make sure we're not dealing with a table scenario
                        if ('artisan' in line) or ('Artisan' in line) or ('cap' in line):
                            artisan_cap = True
                        else:
                            if not artisan_cap or 'board' in line:
                                artisan_cap = False
                                if not any(cap in line for cap in caps):
                                    res = list(map(int, temp))

                                    # only get prices more than $100
                                    # this helps filter out keyboards, and sets a baseline for what
                                    # quality of keyboard we're looking to track
                                    filtered = list(filter(lambda x: x>100,res))
                                    for price in filtered:
                                        sellers.loc[len(sellers.index)] = [price,line,time_obj,url]

After querying for about an hour, through 390,000 total posts, we can finally get our list of items.

In [657]:
sellers.head()
Out[657]:
price line_text post_time link
0 120 Bought new in Dec. 2012, no visible wear/shine... 2013-01-14 19:17:22 http://www.reddit.com/r/mechmarket/comments/16...
1 130 I believe I paid a total of.. $130-$140 after ... 2013-04-23 15:13:51 http://www.reddit.com/r/mechmarket/comments/1c...
2 140 I believe I paid a total of.. $130-$140 after ... 2013-04-23 15:13:51 http://www.reddit.com/r/mechmarket/comments/1c...
3 160 However, I would really prefer selling it to a... 2013-04-28 02:26:47 http://www.reddit.com/r/mechmarket/comments/1d...
4 110 Looking for ~$110 bucks. I'm located in NH. 2013-06-28 18:26:43 http://www.reddit.com/r/mechmarket/comments/1h...
In [656]:
len(sellers)
Out[656]:
17640

Looking at the data, we get nearly 18,000 selling posts that have been tagged by the user as selling keyboards.

Unfortunately, we still need to do a deep clean. For example, many people also post multiple times, since many posts can be left unseen throughout the week due to sheer volume. We need to go in and do some manual cleaning up of the data, including removal of duplicates and removal of specific items our first filter failed to catch. Unfortunately, lots of this extra filtering requires domain knowledge. It is impossible to catch everything, since there are no standards, but we can try out best to filter out items based on their anticipated prices and titles.

Many of these deletions focus on low price - this is because people are less likely to be selling only accessories at the \$200 and up level. Many popular accessories can sell for >\$100, but at a higher price point sometimes these items are included in the build. For example, Gateron Black Inks are an extremely popular switch type, but because of this are often included in more expensive boards. However, if someone is selling a board below \$200 that has Gateron black inks included, usually either the board is very cheap, or the seller is selling the switches on their own.

In [ ]:
# Remove Duplicates
sorted = sellers.drop_duplicates(subset='line_text').sort_values(by='price')

#remove instances of '90x ...'
sans_switch = sorted.drop(sorted[(sorted['line_text'].str.contains('\d\dx[^0-9]'))].index)

# At less than $150, including the number of Gateron switches often indicates the user is selling
# just switches
ess = sans_switch.drop(sans_switch[(sans_switch['line_text'].str.contains('\d\d Gateron')) & \
                                   (sans_switch['price']<150)].index)

# gmk stabilizers are not the same as keycaps, and dont inflate the price drastically, but people like
# to mention that they are included
stabs = ess[(ess['line_text'].str.contains('gmk(.*)stab'))]

# drop many cases of keycap selling
ess2 = ess.drop(ess[(ess['line_text'].str.contains('base'))].index)
ess2 = ess2.drop(ess2[(ess2['line_text'].str.contains('gmk'))| ess2['line_text'].str.contains('Gmk')].index)
ess3 = ess2.append(stabs)
ess4 = ess3.drop(ess3[(ess3['line_text'].str.contains('keyset')) & (ess3['price']<350)].index)
ess4 = ess4.sort_values(by='price').reset_index(drop=True)

# Lets work in only USD
ess4 = ess4.drop(ess4[ess4['line_text'].str.contains('[Aa][Uu][Dd]|[Cc][Aa][Dd] ')].index)

We still have more to remove. Looking through the list of items, we can identify specific cases where individuals frequently sell certain types of items with certain keywords, so we can try out best to do targeted removals

In [ ]:
# one guy jokingly offered to sell something worth $20 for $10000000
ess4 = ess4.drop([14250])

# more stabilizers that are high quality, but aren't just switches
z_stab = ess4[ess4['line_text'].str.contains('eal stab') & (ess4['price']<150)]

# specific instances of individuals selling only switches
ess4 = ess4.drop((ess4[((ess4['price']<200) & (ess4['line_text'].str.contains('x\d\d')) |\
      ((ess4['price']<150) & ((ess4['line_text'].str.contains('zeal') | \
                                ess4['line_text'].str.contains('ateron black') | ess4['line_text'].str.contains('Zeal') |\
                               ess4['line_text'].str.contains('ateron Black')))))]).index)
ess4 = ess4.drop(ess4[((ess4['price']<=105) & (ess4['line_text'].str.contains('ateron Black')))].index)

# keep some keyboards with stabilizers
ess4.append(z_stab)

# remove instances of people selling just Tealios
ess4 = ess4.drop(ess4[((ess4['price']<=140) & (ess4['line_text'].str.contains('ealios')))].index)

# remove instances of people selling ultrasonicated (washed) switches
ess4 = ess4.drop(ess4[(ess4['price']<150) & ess4['link'].str.contains("ultrasonic")].index)

# FINALLY sort the values based on price
ess4 = ess4.sort_values(by='price').reset_index(drop=True)

This process would be extremely difficult without domain knowledge, which is just unfortunately the problem with scraping through a public forum such as reddit. Collecting data from a more well formatted site could be easier.


Now that we finally have some cleaner data, we can take a look at it in its final form.

In [ ]:
ess4.head()
Out[ ]:
price line_text post_time link
0 101 Also, link to the keyboard, it would be $92 + $9 shipping for $101 total. 2014-03-21 23:16:32 http://www.reddit.com/r/mechmarket/comments/211bux/wtt_wts_ducky_zero_mx_reds_for_any_mx_greenblue/
1 102 Rosewill Apollo: $75 shipped ($102 new) 2016-10-19 18:07:02 https://www.reddit.com/r/mechmarket/comments/58brwx/usmo_h_das_4_professional_cherry_mx_browns_and/
2 102 Rosewill Apollo: $85 shipped ($102 new) 2016-10-11 00:49:44 https://www.reddit.com/r/mechmarket/comments/56vgkx/usmo_h_das_4_professional_cherry_mx_browns_and/
3 103 * ~~Planck #1 is MIT layout, pcb, with a formed bottom. Cherry MX Reds everywhere and white LEDs already installed (no sip sockets). It's missing one of the 5 screws because one of the washer/spacer things that came with the board basically didn't fit. This has exactly 0 impact on the feel of the board and I'll throw that piece in anyway. **$103 shipped**~~ Sold 2015-11-24 22:01:57 https://www.reddit.com/r/mechmarket/comments/3u4v9k/usma_h_plancks_planck_keycaps_62g_springs_w/
4 103 16 - Bro Sparkler Blue Rustler | $103 2017-06-10 04:24:36 https://www.reddit.com/r/mechmarket/comments/6gdfix/au_h_brocaps_etf_artisans_gmk_skeletor_gmk/
In [658]:
len(ess4)
Out[658]:
13904

We managed to cut about 4000 more entries from the list, which leaves us with a whopping 14,000 sales to comb through.


Part 2: So what are we actually looking at here?

Data Visualization

Let's just start by plotting all of the data, price versus time. Since we used datetimes, we need to be able to format the dates to be able to use it on an axis, so we can us mdates to go on the axes of a matplotlib plot. matplotlib is very well documented, and their webpage includes tons of examples for plotting data.

In [193]:
import matplotlib.dates as mdates

ess_temp = ess4.copy()
ess_temp['post_time'] = pd.to_datetime(ess_temp['post_time'], format='%Y-%m-%d %H:%M:%S')
plt.figure(figsize=(18,10))

# make a scatter plot of price vs time
plt.scatter(ess_temp['post_time'],ess_temp['price'])
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.xlabel("Posting Time")
plt.ylabel("Price ($)")
plt.title("Prices vs Time")
plt.show()

Ok, uhhh, not gonna lie - that sucks.

BUT

We can try to make it not suck. We can start by taking the time apart and splitting it into months, half years, and quarter years. Because theres so much data over such a long period of time, it will be difficult to visualize the data, so we can try a bunch of different bins to see whats eassier on the eyes, but doesnt obscure the data too much.

In [ ]:
# split into months + year
ess4['month']=ess4['post_time'].apply(lambda x: x[:7])
# split into just year
ess4['year']=ess4['month'].apply(lambda x: x[:4])
# split
ess4['half']=ess4['month'].apply(lambda x: float(x[:4]+str(float(math.floor(int(x[5:7])/6)/2))[1:]))
ess4['quarter']=ess4['month'].apply(lambda x: x[:4]+'Q'+str(math.ceil(int(x[5:7])/4)))
In [ ]:
ess4.head()
Out[ ]:
price line_text post_time link month year half quarter
0 101 Also, link to the keyboard, it would be $92 + ... 2014-03-21 23:16:32 http://www.reddit.com/r/mechmarket/comments/21... 2014-03 2014 2014.0 2014Q1
1 102 Rosewill Apollo: $75 shipped ($102 new) 2016-10-19 18:07:02 https://www.reddit.com/r/mechmarket/comments/5... 2016-10 2016 2016.5 2016Q3
2 102 Rosewill Apollo: $85 shipped ($102 new) 2016-10-11 00:49:44 https://www.reddit.com/r/mechmarket/comments/5... 2016-10 2016 2016.5 2016Q3
3 103 * ~~Planck #1 is MIT layout, pcb, with a forme... 2015-11-24 22:01:57 https://www.reddit.com/r/mechmarket/comments/3... 2015-11 2015 2015.5 2015Q3
4 103 16 - Bro Sparkler Blue Rustler | $103 2017-06-10 04:24:36 https://www.reddit.com/r/mechmarket/comments/6... 2017-06 2017 2017.5 2017Q2

We can use matplotlib to visualize the bins using violin plots, which show the density of different prices, along with their locations. It seems like using half years makes the densest, but still visible, distributions. Going to quarter increases the number of distributions too much, and year results in only a few.

In [341]:
# Takes a dataframe and plots the data
# in this case, this function will specifically use the 'half' time scale
# which is implemented by counting the earlier half of the year as 20XX.0, and
# the latter as 20XX.5
def plot_violin(df, t,label):
    warnings.filterwarnings("ignore")
    test = dict()
    df['half'] = df['half'].astype('float32')
    for index, row in df.iterrows():
        if not row['half'] in test:
            test[row['half']] = np.array(row[t])
        else:
            test[row['half']] = np.append(test[row['half']],row[t])

    plt.figure(figsize=(18,10))
    a = plt.violinplot(list(test.values()),list(test.keys()),showmedians=True,widths=.5);
    for b in a['bodies']:
        b.set_edgecolor('black')
    plt.xlabel("Year")
    plt.ylabel(label)
    plt.title("Violin Plot");
    return a
In [343]:
plot_violin(ess4,'price','Price ($)');

Unfortunately, the outliers make this plot extremely difficult to read. To help give us a better idea, we can take a look at some boxplots, with and without outliers. Thankfully, pandas provides out of the box boxplot methods for us to use.

In [ ]:
warnings.filterwarnings("ignore")
ax = ess4.boxplot(column = 'price',by = 'half',grid = False,figsize=(18,10));
ax.set_xlabel("Year")
ax.set_ylabel("Price ($)")
ax.set_title("")
Out[ ]:
Text(0.5, 1.0, '')

This is still bad, but lets take a look without outliers just to get an idea.

In [ ]:
warnings.filterwarnings("ignore")
ax = ess4.boxplot(column = 'price',by = 'half',grid = False,figsize=(18,10),showfliers=False);
ax.set_xlabel("Year")
ax.set_ylabel("Price ($)")
ax.set_title("")
Out[ ]:
Text(0.5, 1.0, '')

Clearly we can see that the maximum price for second-hand keyboards is increasing, and it seems like generally the median is increasing over time, save for the first year of the subreddit's existence.


Part 3: So what?

Exploratory Data Analysis

Now that we can actually make sense of the data visually, we want to be able to get some real quantitative conclusions from it. We can do some basic math to hopefully give us a better idea relationships within the data. First of all, because we capped the bottom at 100, and prices have no maximum, we can try to scale down to account for massively inflated prices by taking the log.

In [155]:
ess5 = ess4.copy()
ess5['log'] = ess5['price'].apply(lambda x: np.log(x))
ess5.head()
Out[155]:
price line_text post_time link month year half quarter log
0 101 Also, link to the keyboard, it would be $92 + ... 2014-03-21 23:16:32 http://www.reddit.com/r/mechmarket/comments/21... 2014-03 2014 2014.0 2014Q1 4.615121
1 102 Rosewill Apollo: $75 shipped ($102 new) 2016-10-19 18:07:02 https://www.reddit.com/r/mechmarket/comments/5... 2016-10 2016 2016.5 2016Q3 4.624973
2 102 Rosewill Apollo: $85 shipped ($102 new) 2016-10-11 00:49:44 https://www.reddit.com/r/mechmarket/comments/5... 2016-10 2016 2016.5 2016Q3 4.624973
3 103 * ~~Planck #1 is MIT layout, pcb, with a forme... 2015-11-24 22:01:57 https://www.reddit.com/r/mechmarket/comments/3... 2015-11 2015 2015.5 2015Q3 4.634729
4 103 16 - Bro Sparkler Blue Rustler | $103 2017-06-10 04:24:36 https://www.reddit.com/r/mechmarket/comments/6... 2017-06 2017 2017.5 2017Q2 4.634729

Now that we've done some basic operations, we can take a look at what those operations has done to the data distribution itself.

In [75]:
plot_violin(ess5,'log','log(price)');

 When we look at the data, its clear that 2013 was an unusual year, since it was the very beginning of r/mechmarket. From what it seems like, sellers were unsure of what their items could actually be valued at, so many sellers either took a higher, but conservative pricing, or a lower, more ordinary pricing, generating a more bimodal distribution, and one with less variance. Because of this, we can say that in general, 2013 does not match the current climate of r/mechmarket, and to find a more general pattern we are going to remove both 2013H1 and 2013H2.

In [166]:
sans_2013 = ess5[ess5['half']>=2014]
sans_2013.head()
Out[166]:
price line_text post_time link month year half quarter log
0 101 Also, link to the keyboard, it would be $92 + ... 2014-03-21 23:16:32 http://www.reddit.com/r/mechmarket/comments/21... 2014-03 2014 2014.0 2014Q1 4.615121
1 102 Rosewill Apollo: $75 shipped ($102 new) 2016-10-19 18:07:02 https://www.reddit.com/r/mechmarket/comments/5... 2016-10 2016 2016.5 2016Q3 4.624973
2 102 Rosewill Apollo: $85 shipped ($102 new) 2016-10-11 00:49:44 https://www.reddit.com/r/mechmarket/comments/5... 2016-10 2016 2016.5 2016Q3 4.624973
3 103 * ~~Planck #1 is MIT layout, pcb, with a forme... 2015-11-24 22:01:57 https://www.reddit.com/r/mechmarket/comments/3... 2015-11 2015 2015.5 2015Q3 4.634729
4 103 16 - Bro Sparkler Blue Rustler | $103 2017-06-10 04:24:36 https://www.reddit.com/r/mechmarket/comments/6... 2017-06 2017 2017.5 2017Q2 4.634729
In [177]:
plot_violin(sans_2013,'log','log(price)')

 Looking at the data, it's obvious that some of the bins are clearly not perfectly gaussian. Although some of the half-years are slightly more gaussian than others, many of them show a distinct bimodality rather than the linear relationship I was originally hoping to see. Based on the data, it seems that in general, there are a lot of people willing to buy and sell "budget-friendly" boards, but there is also a steadily growing group of "enthusiast" buyers, individuals looking to spend more money on a higher quality or more aesthetic product. There is also a third category that is slowly developing, which is the "collectors" - people willing to spend exorbitant amounts of money on their keyboards.

Part 4: Are Prices Actually Increasing?

Hypothesis testing

Even though the data is not perfectly linear, we can still try to figure out if the price of keyboards is increasing over time. We can use the Pearson's Correlation Test to see:\  a) how strong the relationship is\  b) how likely it is that there is no relationship

In [202]:
from scipy import stats
from datetime import datetime
from datetime import timezone
#datetime.strptime('2014-03-21 23:16:32', '%Y-%m-%d %H:%M:%S').replace(tzinfo=timezone.utc).timestamp()
unix_times = sans_2013.copy()
unix_times['u_time'] = unix_times['post_time'].apply(lambda x: int(datetime.strptime(x, '%Y-%m-%d %H:%M:%S').replace(tzinfo=timezone.utc).timestamp()))
time = list(unix_times['u_time'])
prices = list(unix_times['price'])
coefficient, p_value = stats.pearsonr(time,prices)
print("P-value : "+str(p_value))
print("Strength : "+str(coefficient))
P-value : 1.597469755393927e-256
Strength : 0.28448941667107186

 We can see that the p-value is extremely low, where $p<.001$, but what does that actually mean? The low p-value means that is we assume the price of keyboards has not gone up at all or that the price is random and not associated with year, the probability of collecting this data would be less than .1%. This generally means that we are really quite confident that the price is increasing each year.

Of course, this confirms what we can see visually - looking at the original violin and box plots of the data, in general the center of the data and the distribution increase each year. This is even more clear in the plots with outliers removed and the log taken. The density of keyboard prices slowly creeps up and up over the years, much like a toothpaste bottle.


Part 5: What can we actually see?

Clustering

 Now that we know what the data looks like in a cleaned up state, and can see some relationships, we can tell that the data seems to be forming around these three groups, the "budget-friendlies", the "enthusiasts", and the "collectors". With that in mind, we can try to do some clustering to figure out where exactly these clusters might be forming, and at what prices. sklearn is a popular machine learning library that features many classification, regression, and clustering algorithms, and is thoroughly documented here.

 Looking at the data, we can kind of convince ourselves that the groups can be modelled with gaussian distributions - each group has an average amount they are willing to spend, with an assumption of normality. Because of this data form, we can try to use a Gaussian Mixture Model from sklearn. Gaussian mixture models are used for exactly this purpose - the model will approximate areas where there appears to be some combination of Gaussian distributions, and use the distributions as class labels. GMM's have an advantage of being very fast algorithms, and have a lot of flexibility.

In [ ]:
import numpy as np
from sklearn import mixture

Fitting a Gaussian mixure model is actually very easy. sklearn allows you to do most machine learning algorithms with only a couple lines, and the presets in general are pretty good!

In [ ]:
# as long as the data is formatted in a 2D array, we're good to go
ted2 = ted.reshape(len(ted),1)

# the number of clusters you're anticipating is required
g = mixture.GaussianMixture(n_components=3)
g.fit(ted2)

 Unfortunately, the hard part is the visualization. Sure its great to be able to fit the data, but what we really want to do is to be able to see the Gaussian directly on our violin plots. Doing this would help visually identify exactly where each mean is, and allows us to see the relationship between the model output and the actual data distribution. We can start by making a similar violin plotting method to begin our visualization. We also want to include lists for our three classification groups - we want to keep track of the means identified of each group for later.

In [ ]:
fancy = []
nice = []
standard = []
def plot_violin2(df, t,label):
    figure, axes = plt.subplots(figsize=(18,10))
    

Then, we can start to fit the data for each half year, and use the models output to populate our lists

In [ ]:
    for cur_half in df.half.unique():
        
        # get just the current year
        ted = np.array(sans_2013[sans_2013['half']==cur_half]['log'])
        ted2 = ted.reshape(len(ted),1)
        
        # fit the model
        g = mixture.GaussianMixture(n_components=3)
        g.fit(ted2)
        means = np.ndarray.flatten(g.means_)
        means = list(zip(means,range(0,3)))
        means.sort()
        
        # make sure to keep all of your fitted means
        fancy.append([cur_half,means[2][0]])
        nice.append([cur_half,means[1][0]])
        standard.append([cur_half,means[0][0]])

Then, we need to actually draw circles for where the means are. In this case, I chose to draw concentric circles using matplotlib to help visualize the gaussian distribution, but other methods are available!

In [ ]:
        weights = g.weights_
        colors = ['plum','mediumpurple','darkorchid']
        for a in range(len(means)):
            Drawing_uncolored_circle = plt.Circle((cur_half,means[a][0]),weights[means[a][1]]/max(weights)*.25,\
                                                  fill = False,color=colors[0])
            axes.add_artist( Drawing_uncolored_circle )
            Drawing_uncolored_circle = plt.Circle((cur_half,means[a][0]),weights[means[a][1]]/max(weights)*.20,\
                                                  fill = False,color=colors[1])
            axes.add_artist( Drawing_uncolored_circle )
            Drawing_uncolored_circle = plt.Circle((cur_half,means[a][0]),weights[means[a][1]]/max(weights)*.15,\
                                                  fill = False,color=colors[2])
            axes.add_artist( Drawing_uncolored_circle )
            axes.scatter([cur_half],means[a][0],c='m')

Finally, we can draw the violin plot, and see where this takes us

In [590]:
    warnings.filterwarnings("ignore")
    test = dict()
    df['half'] = df['half'].astype('float32')
    
    # extract the half year data
    for index, row in df.iterrows():
        if not row['half'] in test:
            test[row['half']] = np.array(row[t])
        else:
            test[row['half']] = np.append(test[row['half']],row[t])
    
    # plot the actual violin plot
    m = axes.violinplot(list(test.values()),list(test.keys()),showmedians=True,widths=.5);
    
    # extra customization options!
    for b in m['bodies']:
        b.set_edgecolor('black')
        b.set_facecolor('#9DE9E4')
    axes.set_xlabel("Year")
    axes.set_ylabel(label)
    figure.suptitle("Violin Plot");
    
    # return, just in case we need to add some plots
    return figure,axes

And here's the plot.

In [591]:
plot_violin2(sans_2013,'log','log(price)');

It definitely looks pretty interesting. Over time, its clear that the "collector" group is growing rapidly, and this year the number of extremely large sales is a large fraction of total sales.

Having a the data for the three different groups is great, but what if we could do more? In general, these groups also follow a positive linear relationship. Not only that, because we used a Gaussian mixture model, the clusters are already assumed to be Gaussian. Realistically, this is a great place to perform a linear regression, something that sklearn also offers. Performing the linear regression with sklearn is just as easy as using the Gaussian mixture model, and can be done in one line.

In [ ]:
from sklearn.linear_model import LinearRegression

# fits the linear regression model
# in this case, we are specifically using the means from the fancy cluster
reg = LinearRegression().fit(np.array(fancy)[:,0].reshape(len(fancy),1),np.array(fancy)[:,1].reshape(len(fancy),1))
print("slope: " + str(reg.coef_[0]))
print("intercept: " + str(reg.intercept_))
slope: [0.1552501]
intercept: [-307.05205473]

Based on the linear regression, it seems like a "collector" can expect a $0.15 increase each year.

But wait, that doesn't seem right at all.

In fact, this linear regression is not a linear re`gression with regards to the price, it's a linear regression with regards to the logarithm of the price. The actual formula solved is:

$log(y) = .154x - 304.52$

So the actual increase per year is:

$\Delta y = e^{(.154*(x_1))-304.52} - e^{(.154*(x_0))-304.52}$

From 2021 to 2022, thats a whopping \$140 increase from \$840 to \$980 for collectors!

Of course, we still have to perform the linear regression on the remaining groups, and it would be great to visualize it, so lets use our last used violin plotter:

In [ ]:
fig, ax = plot_violin2(sans_2013,'log','log(Price)');

We're going to use a dictionary to store the predicted output of each linear regression - that way we can do it all in one go without needing to compute the residual right away.

In [ ]:
price_points = [standard,nice,fancy]
names = ['standard','nice','fancy']
regressions = dict()
colors = ['b','g','r']

Now, lets actually perform the regression on each cluster

In [615]:
# iterate through each cluster
for price in range(len(price_points)):
    l = len(price_points[price])
    p = price_points[price]
    
    # perform the linear regression
    reg = LinearRegression().fit(np.array(p)[:,0].reshape(l,1),np.array(p)[:,1].reshape(l,1))
    
    # output the coefficients
    print("slope: " + str(reg.coef_[0]))
    print("intercept: " + str(reg.intercept_))
    
    X = list(np.arange(2014,2021.5,.5))
    Y = reg.predict(np.array(X).reshape(len(X),1))
    
    # this is the important part: save the predictions!
    regressions[names[price]] = [X,Y]
    ax.plot(X,Y,color=colors[price],label=names[price])
ax.legend()    
slope: [0.03876081]
intercept: [-73.28233445]
slope: [0.09771605]
intercept: [-191.65621095]
slope: [0.15400571]
intercept: [-304.52405147]
Out[615]:
<matplotlib.legend.Legend at 0x7f5a95ed85d0>

For the three categories, the expected increase is as follows:

Standard:

$\Delta y = e^{(.128*(x_1))-253.09} - e^{(.128*(x_0))-253.09}$

This year, the model predicts that the average for an enthusiast board on r/mechmarket will increase from \$139 to \$144 - an increase of approximately \$5.

Enthusiast:

$\Delta y = e^{(.0977*(x_1))-191.66} - e^{(.0977*(x_0))-191.66}$

This year, the model predicts that the average for an enthusiast board on r/mechmarket will increase from \$327 to \$361 - an increase of approximately \$30.

Collector:

$\Delta y = e^{(.154*(x_1))-304.52} - e^{(.154*(x_0))-304.52}$

This year, the model predicts that the average for a collectors board on r/mechmarket will increase from \$840 to \$980 - an insane jump of \$140.

Now that we have our linear regressions, we can finally do our sanity check and look at the residuals. Because we stored all the predictions earlier, we dont need to run the models again, we can just assign each years expected price to dictionaries, subtract it from the observed price, and send it to a pandas DataFrame for nice viewing.

In [653]:
# We want all three types of boards to be in the DataFrame
residual_frame = pd.DataFrame(columns=['fancy','nice','standard'])
reg = dict()

# iterate through each cluster
for a in range(len(names)):
    reg1 = dict()
    f1 = np.array(price_points[a])
    a0 = f1[:,0]
    a1 = f1[:,1]
    reg2 = {a0[i]: a1[i] for i in range(len(a0))}
    
    # calculate the residual for each year per cluster
    for half in range(len(regressions[names[a]][0])):
        year = regressions[names[a]][0][half]
        reg1[year] = reg2[year]-regressions[names[a]][1][half]
    
    # put the residuals into the DataFrame
    residual_frame[names[a]] = np.ndarray.flatten(np.array(list(reg1.values())))
residual_frame
Out[653]:
Collector Enthusiast Budget-Friendly
0 0.142638 0.143276 0.043348
1 -0.096390 -0.124393 -0.034787
2 0.095217 0.044313 0.014172
3 -0.020432 -0.029291 -0.026132
4 -0.251085 -0.135923 -0.077596
5 -0.031884 0.036140 -0.028422
6 -0.033657 -0.075293 -0.050068
7 0.130607 0.037112 0.112192
8 0.087917 0.074076 0.105559
9 -0.302682 -0.230367 -0.079429
10 -0.156882 0.081475 0.028363
11 0.779645 0.447113 0.110561
12 -0.145350 -0.106990 -0.021440
13 0.056299 0.032350 -0.028747
14 -0.253961 -0.193596 -0.067572

Of course, why look at numbers when we can look at charts. These violin plots require no extra user-defined methods, as we only want to look at three violin plots, so let's just plot them directly.

In [652]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3,figsize=(18,10));

# Collector plot
ax1.set_title('Collectors Residuals')
ax1.violinplot(residual_frame['fancy'])

# Enthusiast plot
ax2.set_title('Enthusiast Residuals')
ax2.violinplot(residual_frame['nice'])

# Standard plot
ax3.set_title('Budget-Friendly Residuals')
ax3.violinplot(residual_frame['standard']);

 Looking at the residuals, they mostly appear to be centered with a Gaussian distribution at zero. For the residuals of the collectors, there is one extreme residual, stemming from an instance where the model struggled to separate the standard Gaussian from the enthusiast Gaussian clusters, and the same can be said about the enthusiast clusters. The standard residuals are the most unusual of the three, but part of this unusal shape is from the fact that we capped the price to \$100. That cap is a decision we made arbitrarily to limit our results to higher end boards, but knowing what we know now, it might be a good idea to go back and redo the data collection stage to include all boards selling for less than \$100. Of course, the data pruning stage would likely prove to be much more intense. These distributions help meet the Gaussian assumption of a linear regression. To look for homoscedasticity, we can plot the residuals as a scatterplot.

In [69]:
# grab all the years to plot with the residuals
years = np.arange(2014,2021.5,.5)

# plot all 3 residuals along side each other
plt.scatter(years,residual_frame['fancy'],c='r',label='Collector')
plt.scatter(years,residual_frame['nice'],c='b',label = 'Enthusiast')
plt.scatter(years,residual_frame['standard'],c='g',label = 'Budget')
plt.legend()

# plot the regression line at 0
plt.plot([2014,2021],[0,0],c='k')
Out[69]:
[<matplotlib.lines.Line2D at 0x7f2ae9b0c610>]

 The residuals are mostly evently distributed around 0. There are two unusual spots at 2019.5, where the model struggled to separate the standard Gaussian from the enthusiast Gaussian clusters, but otherwise the residuals are fairly well behaved. With this plot we can see that all the assumptions required for a linear regression are met:

  • Linear Relationship (seen after clustering)
  • Independence (residuals are parallel to the regression line)
  • Homescedasticity (the residuals are evenly spread out over time)
  • Normality (the residuals are normally distributed)

Just because the assumptions are met doesn't mean that the regression is true. However, it does help justify our use of the regression to glean some information about the relationship between price and time.


Part 6: Give it to me Straight, Doc

Conclusion

 Based on the data collected, it seems like yes, the prices of keyboards on the second hand market are increasing. However, these prices aren't increasing at the same rate for everyone. If you're just looking to buy a nice affordable keyboard, for now it seems like people will continue to sell at a lower price, and as the quality of keyboards goes up and up and the size of the hobby grows, buying a keyboard on a "budget" will likely be increasingly more value for price. For enthusiasts, yes the prices are going to increase. Based on the model, it seems like this year an average board will increase in price by about \$30 dollars. When you're already spending about \\$300-400 on your board though, that 10% increase might seem like a reasonable amount to wait for a board you really like. The biggest price hikes are going to be for collectors, at a predicted increase of about \$140. Fortunately, this price increase affects the least amount of people. For many collectors who are willing to shell out any amount of money for their 'end-game' board, that increase is not really a big deal, and when you've already sunk such a large sum of money into keyboards, its likely that a collector will either stop buying altogether or keep investing in their hobby.

 For people in each category, finding deals that place you lower than the predicted average price for this year would be a great place to start if you're looking for a deal. At least for ordinary people or people who are just getting into the hobby, you can still afford to wait without being worried about to breaking the bank to get in on the hype.

jelly%20poms.jpg

Big ups to my friend Jason for letting me use his keyboard photos.



Bonus unusual plot

wackjob.png