Skip to content

A distributed algorithm applied to the bitcoin blockchain that allows to create a new representation of the transaction - a clusterized graph that combines all the addresses belonging to the same owner/organization.

License

gbossi/BitcoinClusterAggregator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bitcoin Cluster Aggregator

Author:

Giacomo Bossi

Abstract

The aim of the project is to parse and analyze the bitcoin block-chain in order to retrieve useful information about the transaction between known wallets, using a distributed approach. In this project input and output data are stored inside Hadoop Filesystem (hdfs) and the algorithms computation is managed by the Spark cluster computing system.

Approach

Since the bitcoin block-chain is heavily compressed, first we must unwrap all the transaction in a more readable way and after condensing all the addresses belonging to the same entity and build an entity graph.

Bitcoin Transaction

Transaction Composition

In order to retrieve all the data needed to compose each transaction it has been used the Hadoop Crypto-Ledger library (LINK), a multi-blockchain parser, that allow to translate the raw-byte data into an object. But this unwrap of the blockchain produce this result:

To recompose each transaction in a more readable way (Input,Output,Amount), it is possible to join two transaction using as key the transaction ID and the transaction index and reconstruct the flow of coins between two wallets in the following way:

So, given a current transaction and a previous one with the backward link of the current pointing to the previous, the resulting unwrapping it’s the following:

Entity

We define an entity as a cluster of addresses belonging to the same owner/organization.

Partial Entity

Since a transaction can involve multiple input and multiple output, it’s possible to state that all the inputs belong to the same entity. This cluster is formed by all the addresses that has been used simultaneously as an input in a transaction.
So, for example an entity is the following:

But this kind of entity results incomplete, since there could be another transaction that involves one of more addresses previously used. It’s clear the concept of partial entity, the inputs address of a single transaction describe a possible subset of all the entity’s addresses.

Complete Entity

A complete entity can be defined as the union of the intersection of the partial entities. This kind of clustering results into a recursive intersection between all the inputs of all the transaction, since it’s required to search all the intersections between the other partial entity, and in positive case we merge the two partial entity and start from the beginning again. But this kind of algorithm it’s just to complex to be computed in a feasible time, so we have developed an algorithm that translate the domain into a graph and find all the connected components inside it.

Complete Entity Algorithm

In order to compute the intersection over all the partial entity, we first transform the intersection over the transactions into a graph and then we have applied an optimized version of the connected component algorithm in order to accumulate all the intersection and provide a solution.

Generating chain subgraph

First, given all the partial entity we create a new tuple with the following values inside: (address,partialEntity) After we accumulate for each address all the entity in which it is present: (address,List[PartialEntity])

Now we got for each address all the intersection over the partial entities. From this list we create a list of edges that links all the partial entities together. In order to save memory and computational power, the shape of all the subgraphs defined by a partial entity is a chain.

So, given a list of edges it’s possible to run the connected components algorithm over the set of subgraphs.

Connected Components

The result of the connected components algorithm is the following:

Initial Graph Connected Component Result Resulting Entity

Every entity is the union of a set of partial entity. So, for each cluster, retrieving the content of each partial entity and reassembling together all the addresses we finally find a complete entity.

Entity Graph

The result of the previous computation over all the nodes of the block-chain combined with the all the transaction creates a new graph, the entity graph, a useful instrument to track the bitcoin exchanges between entities like bitcoin exchanges, e-commerce stores, cryptocurrency lottery and many other. A possible representation of the computation is the following

Inside the graph the (-1) node indicates all the exchanges that points a new (never-used) bitcoin address.

Inside the misc folder it's also possible to find two sample algorithms, a pagerank and a triangle counter. The pagerank algorithm returns the ordered list of entity nodes, in order of importance. The triangle counter nodes return the list of entity nodes ordered by the number of triangle found.

Project made during the Erasmus Programme @ National Technique University of Athens

Instruction

Build the project using maven

mvn clean install

Put the bitcoin blockchain raw data inside a hdfs:

hadoop fs -mkdir -p /user/bitcoin/input
hadoop fs -put ~./.bitcoin/blocks/blk*.dat /user/bitcoin/input

Run the cluster-based graph builder:

spark-submit --class Builder --master local[4] ./target/BitcoinExplorer-0.0.1.jar hdfs://localhost:9000/user/bitcoin/input hdfs://localhost:9000/user/bitcoin/output/vertices hdfs://localhost:9000/user/bitcoin/output/edges

Run the Pagerank Algorithm

spark-submit --class Pagerank --master local[4] ./target/BitcoinExplorer-0.0.1.jar hdfs://localhost:9000/user/bitcoin/output/vertices hdfs://localhost:9000/user/bitcoin/output/edges hdfs://localhost:9000/user/bitcoin/output/pagerank

Run the Triangle Count Algorithm

spark-submit --class TriangleCount --master local[4] ./target/BitcoinExplorer-0.0.1.jar hdfs://localhost:9000/user/bitcoin/output/vertices hdfs://localhost:9000/user/bitcoin/output/edges hdfs://localhost:9000/user/bitcoin/output/trianglecount
2018-2019

About

A distributed algorithm applied to the bitcoin blockchain that allows to create a new representation of the transaction - a clusterized graph that combines all the addresses belonging to the same owner/organization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published