Skip to content

Project files for incremental data extraction from CoinDesk Bitcoin Price Index RESTFul API using a cryptographic hashing algorithm

Notifications You must be signed in to change notification settings

richardogoma/incremental-hash-extractor

Repository files navigation

Data Extraction from CoinDesk RESTFul API service

Purpose

This program is designed to extract data from an API endpoint and save it to a local file in CSV format. If the local file already exists, the program will compare the hash values of the API resource and the local file to determine if there are any changes in the data. If there are changes, the program will pre-process and update the local file with the new data. The data is produced from the CoinDesk Bitcoin Price Index API in real-time.

A cryptographic hash function has the property that it is infeasible to find two different files with the same hash value. Hash functions are commonly used with digital signatures and for data integrity. A hash value is a unique value that corresponds to the content of the file (or stream). Metadata such as the file name, extension, timestamps, permissions, etc. have no influence on the hash. However, changing even a single character in the contents of a file (or stream) changes the hash value of the file (or stream).

Prerequisites

  • PowerShell 7 (also, pwsh.exe).
  • Determine the script execution policy on your machine by executing Get-ExecutionPolicy in terminal. The execution policy in PowerShell has to be changed from Restricted to enable you run PowerShell scripts. Run Set-ExecutionPolicy -ExecutionPolicy Unsigned -Scope CurrentUser on terminal.
  • The script requires the Invoke-RestMethod cmdlet to retrieve data from the API endpoint, the Get-FileHash cmdlet to calculate the hash values of the data and the Compare-Object cmdlet to return the changed data. These cmdlets are part of the PowerShell core and should be available in any modern version of PowerShell.
  • Authentication is not required to access the API resource.

Inputs

  • $algorithm: A string that specifies the cryptographic hashing function to use for computing the hash value of the contents of the specified file or stream. The following algorithms are supported: MD5, SHA1, SHA256, SHA384, SHA512. This program uses SHA256 by default.
  • cwd: A string value representing the current working directory. This is used to locate the script file, Calculate-DataHash.ps1, that contains the Get-DataHash function. The batch program, Executor.bat, defaults to the current working directory; this should be modified to where the program was unbundled.

Outputs

  • BitcoinPriceIndex.csv: A CSV file containing the extracted data in a tabular format. The file will be saved in the current working directory.
  • BitcoinPriceIndexHash.json: A JSON file containing the hash value of the local CSV file. The file will be saved in the current working directory.

Usage

# Execute or create a job to run the batch program
cmd /c Executor.bat

When the program is run for the first time, it would execute the script, Extract-FullDatav1.2.2.ps1 to perform a full batch data extraction of data from the API resource. Subsequent runs of the program, the script, Extract-IncrementalDatav1.2.2.ps1 would be executed to perform incremental data extraction if the data has changed and the hash value is different.

incremental data extraction

Notes

Full batch data extraction is a process of extracting all the data from a data source in one go. It involves retrieving all the data from the source and saving it locally. This is useful in situations where the data is needed for the first time or when the data needs to be updated completely.

On the other hand, incremental data extraction is a process of extracting only the data that has changed since the last extraction. It involves comparing the data from the source with the locally saved data and extracting only the changed data. This is useful in situations where the data changes frequently and only the changes need to be processed.

The program performs both full batch and incremental data extraction from an API endpoint. The Get-DataHash function is used to calculate the hash value of the data from the API endpoint or a local file and return the data.

The full batch data extraction script starts by setting the variables for the API endpoint and local file name. It then starts a job to calculate the hash value and return the API endpoint resource using the Get-DataHash function. The script waits for the job to complete and gets the results of the job. It then checks if the count of the results is 2, indicating that both the hash value and data were returned. If the count is 2, the script writes the data to a CSV file and the hash value to a JSON file. It then displays a message indicating that the CSV and JSON files were created.

The incremental data extraction script starts by setting the variables for the API endpoint and local file name. It then starts a job to calculate the hash value and return the API endpoint resource using the Get-DataHash function. The script waits for the job to complete and gets the results of the job. It then reads the JSON file containing the hash value of the previously saved data. If the hash values of the API endpoint resource and the locally saved data are not equal, the script retrieves the changed data using the Compare-Object cmdlet. It then writes the changed data to a CSV file and the updated hash value to a JSON file. It then displays a message indicating that the CSV and JSON files were updated.

To use the program, the user needs to provide the algorithm to be used for calculating the hash value and the current working directory as arguments when running the script. The script will then handle the rest of the data extraction process.

Pre-processed Data Shape

This program could easily become part of either a data intergration or data management system, functioning as a microservice. The pre-processed data could be used by other services within the data management system to perform various tasks, such as data analysis, reporting, app development and more.

The data is structured in columnar format and can be easily consumed. The data schema/headers is represented in docschema.json.

chartName EUR GBP USD updatedtimeISO
Bitcoin 16211.2575 13905.5101 16641.507 04-Jan-23 1:49:00 AM
Bitcoin 16215.5825 13909.2199 16645.9468 04-Jan-23 2:02:00 AM
Bitcoin 16246.5755 13935.8048 16677.7624 04-Jan-23 2:16:00 AM

References

About

Project files for incremental data extraction from CoinDesk Bitcoin Price Index RESTFul API using a cryptographic hashing algorithm

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published