Skip to content

Using k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.

Notifications You must be signed in to change notification settings

mmsaki/clustering-crypto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Power of the Cloud and Unsupervised Learning

Table of Contents

  1. Crypto Clustering Overview
  2. Data Preprocessing
  3. Reducing Data Dimentions Using PCA
  4. Clustering Cryptocurrencies Using K-Means
  5. Visualizing Results
  6. Optional Challenge

File: Clustering Crypto File: Optional Challenge

Crypto Clustering Overview

  • In this assignment I run the k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.

  • Assuming I am a Senior Manager at the Advisory Services team on a Big Four firm.

  • One of my most important clients, a prominent investment bank, is interested in offering a new cryptocurrencies investment portfolio for its customers, however, they are lost in the immense universe of cryptocurrencies.

  • They ask me to help them make sense of it all by generating a report of what cryptocurrencies are available on the trading market and how they can be grouped using classification.

  • I will put my new unsupervivsed learning and Amazon SageMaker skills into action by clustering cryptocurrencies and creating plots to present my results.

  • I am asked to accomplish the following main tasks:

Data Processing

  • Using the following requests library, retreive the necessary data from the following API endpoint from CryptoCompare - https://min-api.cryptocompare.com/data/all/coinlist. HINT: I will need to use the 'Data' key from the json response, then transpose the DataFrame. Name my DataFrame crypto_df.

    # Use the following endpoint to fetch json data
    url = "https://min-api.cryptocompare.com/data/all/coinlist"
    response = requests.get(url).json()
    
    # Create a DataFrame 
    crypto_df = pd.DataFrame(response["Data"]).T
    • With the data loaded into a Pandas DataFrame, continue with the following data preprocessing tasks.

    • Keep only the necessary columns: 'CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply'

    # Keep only necessary columns
    crypto_df = crypto_df[['CoinName','Algorithm','IsTrading','ProofType','TotalCoinsMined','CirculatingSupply']]
    • Keep only the cryptocurrencies that are trading.
    # Keep only cryptocurrencies that are trading
    crypto_df = crypto_df[crypto_df["IsTrading"] == True]
    • Keep only the cryptocurrencies with a working algorithm.
    crypto_df = crypto_df[crypto_df["Algorithm"] != "N/A"]
    • Remove the IsTrading column.
    crypto_df = crypto_df.drop(columns = ["IsTrading"])
    • Remove all cryptocurrencies with at least one null value.
    crypto_df = crypto_df.dropna()
    • Remove all cryptocurrencies that have no coins mined.
    crypto_df = crypto_df[crypto_df["TotalCoinsMined"] > 0]
    • Drop all rows where there are 'N/A' text values.
    crypto_df = crypto_df[crypto_df.iloc[:] != "N/A"].dropna()
    • Store the names of all cryptocurrencies in a DataFrame named coins_name, use the crypto_df.index as the index for this new DataFrame.
    coins_name = crypto_df.index
    • Remove the CoinName column.
    crypto_df = crypto_df.drop("CoinName", axis=1)
    • Create dummy variables for all the text features, and store the resulting data in a DataFrame named X.
    X = pd.get_dummies(data = crypto_df, columns = ["Algorithm", "ProofType"])
    • Use the StandardScaler from sklearn to standardize all the data of the X DataFrame. Remember, this is important prior to using PCA and K-Means algorithms.
    X = StandardScaler().fit_transform(X)

    Reducing Data Dimentions Using PCA

    pca = PCA(n_components=3)
    crypto_pca = pca.fit_transform(X)
    • Once I have reduced the data dimensions, create a DataFrame named pcs_df using as columns names "PC 1", "PC 2" and "PC 3"; use the crypto_df.index as the index for this new DataFrame.
    pcs_df = pd.DataFrame(
        crypto_pca,
        columns = ["PC 1", "PC 2", "PC 3"],
        index = coins_name
    )
    pcs_df.head(10)

Clustering Cryptocurrencies Using k-means

  • Create an Elbow Curve to find the best value for k using the pcs_df DataFrame.
inertia = []
k = list(range(1, 11))

# Calculate the inertia for the range of k values
for i in k:
    k_model = KMeans(n_clusters=i, random_state=1)
    k_model.fit(pcs_df)
    inertia.append(k_model.inertia_)

# Create the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

# Create Elbow plot
df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=k
)

"Elbow Plot"

  • Once I define the best value for k, run the Kmeans algorithm to predict the k clusters for the cryptocurrencies data. Use the pcs_df to run the KMeans algorithm.
# Initialize the K-Means model
model = KMeans(n_clusters = 10, random_state=0)

# Fit the model
model.fit(pcs_df)

# Predict clusters
k_10 = model.predict(pcs_df)
  • Create a new DataFrame named clustered_df, that includes the following columns "Algorithm", "ProofType", "TotalCoinsMined", "TotalCoinSupply", "PC 1", "PC 2", "PC 3", "CoinName", "Class". I should maintain the index of the crypto_df DataFrames as is shown bellow.
clustered_df = pd.concat([crypto_df, pcs_df], axis=1)
clustered_df["Class"] = k_10
clustered_df["CoinName"] = coins_name
clustered_df.head(20)

Visualizing Results

  • In this section, I will create some data visualization to present the final results.
  • Create a scatter plot using hvplot.scatter, to present the clustered data about cryptocurrencies having x="TotalCoinsMined" and y="TotalCoinSupply" to contrast the number of available coins versus the total number of mined coins. Use the hover_cols=["CoinName"] parameter to include the cryptocurrency name on each data point.
# Plot Scatter plot
clustered_df.hvplot.scatter(
    x= "TotalCoinsMined", 
    y= "CirculatingSupply",
    hover_cols=["CoinName"]
)

"Hvplot Cluster"

  • Use hvplot.table to create a data table with all the current tradable cryptocurrencies. The table should have the following columns: "CoinName", "Algorithm", "ProofType", "CirculatingSupply", "TotalCoinsMined", "Class"
clustered_df.hvplot.table(columns=["CoinName", "Algorithm", "ProofType", "CirculatingSupply", "TotalCoinsMined", "Class"], sortable=True, selectable=True)

table

Optional Challenge

  • For the challenge section, I have to upload my Jupyter notebook to Amazon SageMaker and deploy it.

  • The hvplot library is not included in the built-in anaconda environments, so for this challenge section, I should use the altair library instead.

  • Upload my Jupyter notebook and rename it as crypto_clustering_sm.ipynb

  • Select the conda_python3 environment.

  • Install the altair library by running the following code before the initial imports.

    !pip install -U altair
  • Use the altair scatter plot to create the Elbow Curve.

inertia = []
k = list(range(1, 11))

# Calculate the inertia for the range of k values
for i in k:
    k_model = KMeans(n_clusters=i, random_state=1)
    k_model.fit(pcs_df)
    inertia.append(k_model.inertia_)

# Create the Elbow Curve using altair
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

# Create Elbow plot
alt.Chart(df_elbow).mark_line().encode(
    x="k", 
    y="inertia"
)

Elbow Curve Visualization

  • Use the altair scatter plot to visualize the clusters. Since this is a 2D-Scatter, use x="PC 1" and y="PC 2" for the axes, and add the following columns as tool tips: "CoinName", "Algorithm", "TotalCoinsMined", "TotalCoinSupply".
# Plot the scatter with x="PC 1" and y="PC 2"
# Plot the clusters
alt.Chart(clustered_df).mark_circle(size=60).encode(
    x="PC 1",
    y="PC 2",
    color='Class',
    tooltip=['CoinName', 'Algorithm', 'TotalCoinsMined', 'CirculatingSupply']
).interactive()

Altair Cluster plot


About

Using k-Means algorithm and a Principal Component Analysis (PCA) to cluster cryptocurrencies.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published