How do we go about finding these anomolies?
To help answer this question, I employed the Great Expectations library which applies "expectations" to the data pipeline in order look for data which does not meet those expectations. An example expectation that is applied to the pipeline is provided below.
sdf.expect_column_max_to_be_between("MAX_PRICE", 1, 500, result_format="BOOLEAN_ONLY")
This expectation applies the expecation to the data that the maxiumum price should not be greater than 500 or less than 1. If that expectation is not met, then the user is provided a flag (see dashboard image) to take appropriate action. Employing Great Expectations in a streaming environment proved to be a challenging task as it does not currently support a streaming environment. To overcome this challenge, several data processing steps within spark had to be performed in order utilize Great Expectations. S3, Kafka, Spark, Great Expectations, PostgreSQL, Dash
The Deutsche Börse Public Dataset is a near real-time streaming stock data dataset stored in an external S3 bucket. The data dictionary for the data can be viewed in the dataset's Github repo.
The Xetra data is an S3 bucket stored at the following location:
s3://deutsche-boerse-xetra-pds
Each Xetra csv file within the bucket is defined as follows
- ISIN ISIN of the security: string
- Mnemonic Stock exchange ticker symbol: string
- SecurityDesc Description of the security: string
- SecurityType Type of security: string
- Currency Currency in which the product is traded ISO 4217: string (see https://en.wikipedia.org/wiki/ISO_4217)
- SecurityID Unique identifier for each contract: int
- Date Date of trading period: date
- Time Minute of trading to which this entry relates: time (hh:mm)
- StartPrice Trading price at the start of period: float
- MaxPrice Maximum price over the period: float
- MinPrice Minimum price over the period: float
- EndPrice Trading price at the end of the period: float
- TradedVolume Total value traded: float
- NumberOfTrades Number of distinct trades during the period: int