Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature - Distributed storage of market data #3540

Open
NAmorim opened this issue Jan 7, 2022 · 4 comments
Open

Feature - Distributed storage of market data #3540

NAmorim opened this issue Jan 7, 2022 · 4 comments
Labels
help wanted Extra attention is needed improvement New feature or request

Comments

@NAmorim
Copy link
Contributor

NAmorim commented Jan 7, 2022

Is your feature request related to a problem? Please describe.
The platform needs a distributed storage solution to have market historical data always available to all users

Describe the solution you'd like
Many exchanges do not provide a very long historic background of market data. This hinders backtest scenarios to test strategies and validate indicators.
Having a distributed storage like IPFS, market data that is downloaded or processed (indicators) by one user can become available to all users.

Describe alternatives you've considered
Currently each installation has it's own set of market data as a set of JSON files. This can potentially be further optimized using a time-series db.

@NAmorim NAmorim added improvement New feature or request help wanted Extra attention is needed labels Jan 7, 2022
@NAmorim NAmorim added this to To Do in Data-Storage Project via automation Jan 7, 2022
@devosonder
Copy link
Contributor

Which users will be authorized to write to the decentralized database?

@NAmorim
Copy link
Contributor Author

NAmorim commented Jan 10, 2022

I would say everyone is allowed to write data. Probably we can have a tier above a certain amount of reputation are the ones that can write data.

I'm having second thoughts on this feature. Users that are creating strategies do need historic data for backtesting and they can download it from the exchanges.
The majority of users will simply consume trading signals where there's no need of past data.

Thoughts?

EDIT
This is related to issue #3171

@Luis-Fernando-Molina
Copy link
Member

This is a chat I just had relevant to this issue, for those of you looking into this problem:

Kara Lama, [1/10/2022 10:12 AM]
hi bro.

Kara Lama, [1/10/2022 10:14 AM]
I want to talk about distributed storage of market data.

Kara Lama, [1/10/2022 10:14 AM]
#3540

Kara Lama, [1/10/2022 10:15 AM]
I think OrbitDB can be used for this task

Kara Lama, [1/10/2022 10:27 AM]
I need to ask you some questions.
1- Is there a need to keep different data for each crypto exchange?
2- In Decentralized Database architecture, Databases are synchronized. After a while, very large data will begin to form. For example, the 5-year data of 10 crypto exchanges will be very large. How will users cope with this data size?
3- Assuming everyone has write access to the DB, how can we prevent malicious people from corrupting the database?
4- One of the biggest challenges of decentralized databases is the lack of an always-on node. Therefore, it will often be necessary to keep a replication of this database on a server that is never shut down until a sufficient number of replications is obtained. How do you intend to handle this?

Luis Molina, [1/10/2022 10:38 AM]
hey, happy to see someone interested on this subject. I don't have the answers to all the questions, but some ideas of how it should be.

Luis Molina, [1/10/2022 10:39 AM]
I haven't heard about OrbitDB, but I have heard of IPFS

Luis Molina, [1/10/2022 10:39 AM]
Let's formulate the problems we want to solve, I believe there are 2

Luis Molina, [1/10/2022 10:40 AM]
Number1 Users today need to process 100% of the data. Being able to share your processed data would allow other users not to have to download it and process it.

Luis Molina, [1/10/2022 10:43 AM]
Number2 Teams trading together in Trading Armies (a concept that has not emerged yet, but that we are closed to enable), need to have a common trading operation in which they do Data Mining together. They don't need a decentralized database at that point, but there are ideas of allowing more people to join the army by providing Data Mining (i.e. data already processed, based on the set of indicators needed by the army). In that future, they would become part of that trading army not because they are building or backtesting the strategy but because they provide data mining processing.

Luis Molina, [1/10/2022 10:43 AM]
Si back to your questions:

Luis Molina, [1/10/2022 10:44 AM]

  1. as mentioned for problem Number1 it would be cool not to need to process all your data., I would not say it is a need.

Luis Molina, [1/10/2022 10:46 AM]
2. I never though about a decentralized database, but on enabling IPFS at each user doing data mining, so that whatever they calculate, then it becomes available. What would be needed would be a place where to look for the file hashes to be able to actually fetch those files from IPFS when needed.

Kara Lama, [1/10/2022 10:47 AM]
[In reply to Luis Molina]
OrbitDB uses the IPFS infrastructure. https://github.com/orbitdb/orbit-db

Luis Molina, [1/10/2022 10:47 AM]
In that vision each user would have the hash tables at his own data storage, probably a dedicated Github repo for this, and people using the system could get the hashes from there, and then the content from IPFS

Luis Molina, [1/10/2022 10:47 AM]
[In reply to Kara Lama]
yeah, I quicly read about it

Luis Molina, [1/10/2022 10:48 AM]
3. If it is a database, then you will have that problem, if each user publishes its own hash tables, then each user might be able to decide who to trust, problably based on token holdings.

Luis Molina, [1/10/2022 10:50 AM]
4. Since most people would be calculating similar stuff, I guess a lot of the essential data (candles for instance) might be calculated and added to IPFS from multiple users, creating redundancy, and even if the algorithm we implement is so efficient to allow only one guy to produce a certain piece of data, any consumer of the data would be publishing itto IPFS generating redundancy.

Luis Molina, [1/10/2022 10:53 AM]
So in my imagination what we have is IPFS running together with the platform. The charts would read data locally unless is not there, then try to fetch them from IPFS, and if found, this data would be recycled back into IPFS by the new user that now have it.

Luis Molina, [1/10/2022 10:54 AM]
Data Mining should be modified to to fetch data from the local network if it exists, otherwise to fetch it from IPFS, and if there is none there, tell the user to calculate him, something like this.

Luis Molina, [1/10/2022 10:55 AM]
My guess is that for solving Problem Number2 of the future, we need first to solve Problem Number1

Kara Lama, [1/10/2022 10:55 AM]
Ok. Let's imagine that we are doing this job without using a database, just using IPFS. If I delete or modify files (as if it were malicious) and it syncs with IPFS, it will be distributed to all users. Isn't that a huge consistency problem?

Luis Molina, [1/10/2022 10:57 AM]
if you modify a candles file in your PC and publish it to IPFS that does not mean it will reach other users. Your modified files will have a different hash, which is the file name and the key of how to find that file.

Luis Molina, [1/10/2022 10:57 AM]
Lets put in as an example:

Luis Molina, [1/10/2022 10:57 AM]
you downloaded the Candles files of 1 Jan 2022 for BTC/USD Binance

Luis Molina, [1/10/2022 10:57 AM]
that file is already at IPFS calcualtted by someone else with HASH ABC

Luis Molina, [1/10/2022 10:58 AM]
You modify it an now it has hash 123 you publish it to IPFS

Luis Molina, [1/10/2022 10:58 AM]
All users that need that files, needs the HASH in order to request it from IPFS, the question is, where do they get the hash?

Luis Molina, [1/10/2022 10:58 AM]
from the original creator or from you?

Luis Molina, [1/10/2022 10:59 AM]
my guess is that everyone should get the hashes from users with most SA token / reputation

Luis Molina, [1/10/2022 10:59 AM]
so unless you are the one with more SA tokens, algoritmically no one would get the hash of that file from you

Luis Molina, [1/10/2022 11:00 AM]
makes sense?

Luis Molina, [1/10/2022 11:01 AM]
We will still have to make sure that entities with a lot of SA do calculate the files themselves and don't take it from very low SA entities and re publish them blindly

Kara Lama, [1/10/2022 11:04 AM]
SA can be used to understand that witch user's data is more reliable. But let's consider a scenario like this;

Kara Lama, [1/10/2022 11:10 AM]
1- I need data between 2020.01.01 - 2020.12.31 for backtesting.
2- The data between 2020.01.01 - 2020.01.31 is in the IPFS field of user A.
The data between 2020.02.01 - 2020.02.28 is in the IPFS field of user B.
The data between 2020.03.01 - 2020.03.31 is in the IPFS space of user C.
.
.
.
If the range I want is in many IPFS fields (hashes) (I don't know what is in all the hashes in IPFS), how do I find them?

Kara Lama, [1/10/2022 11:13 AM]
So I'll have to trust all A, B, C ... Users because I can't know what's in it until I get that data for myself. Maybe Pinning structure can be used, but I don't know exactly.

Luis Molina, [1/10/2022 11:14 AM]
We can assume that user A would have a dedicated repository where he saves the hashes of the data he calculated, the same for user B and C they have their own repo.

User X running the platform have access to all the User Profiles plugins of all users, and there are explicitly declared the Repositories each user has for storing their hashes.

The way I can imagine this to work is to load the hashes from multiple sources, and create a local hash table for all the available data. That local hash table would prioritize users with more SA for the same files.

Kara Lama, [1/10/2022 11:19 AM]
Yes.

Kara Lama, [1/10/2022 11:19 AM]
Yes we need an index. That's why I thought like this. Creating a database instead of keeping such an index. But this database has two phases. 1- BulkData 2- RealData.
Anyone can write to BulkData. But the permissions to write to RealData are limited. A tool with write permissions to RealData examines the BulkData data and writes it to RealData accordingly. All Superalgos users can connect to RealData (readonly).

Luis Molina, [1/10/2022 11:20 AM]
So the database would be a database of hashes, not of file content, right?

Luis Molina, [1/10/2022 11:20 AM]
define BulkData and RealData please

Kara Lama, [1/10/2022 11:21 AM]
hashes and candle data etc

Kara Lama, [1/10/2022 11:22 AM]
Even if such tools are written, there is a problem. We will need to store users who have permission to synchronize BulkData and RealData using this tool at a central point.

Luis Molina, [1/10/2022 11:24 AM]
also, who would decide which data to store? are you going to store all data produced by every user, even if it is whatever?

Luis Molina, [1/10/2022 11:26 AM]
with just IPFS you don't have those problems, because each one publishes the data they produce or consume, and it does not affect anyone else if I want to produce 100000 terabytes of data, no one would care, because the data is not forced into anyone.

Kara Lama, [1/10/2022 11:26 AM]
[In reply to Luis Molina]
No, not users' private data, only data that concerns all superalgos.

Kara Lama, [1/10/2022 11:27 AM]
Just to be able to solve these problems with IPFS, I'll think about it. I will share the results with you.

Luis Molina, [1/10/2022 11:28 AM]
ok great, I am not opposing a database or a decentralized database, just that I don't see it appropriate at the moment with my current knowledge and current understanding of the problem. See you later then.

Kara Lama, [1/10/2022 11:29 AM]
After this conversation, I understood better. You do not want a common data area. You want a sharing system. Is it true?

Luis Molina, [1/10/2022 11:38 AM]
[In reply to Kara Lama]
It is not that I don't want it, it is that I see problems with no clear solution at a common database, like the ones mentioned. The same problems that you saw and did not have a clear answer to. Who will decide this or that? How do we prevent malicious actors contaminating the data or flooding the storage with unusable data? There are probably no solutions for that at an open system like this.

Luis Molina, [1/10/2022 11:41 AM]
The system should allow you to choose to use calculated data from other users or download it and calculate it by yourself. Then users will choose which to use on a case by case bases.

Maybe you would like to browse calculated data at the charts, of any exchange and any market, including indicators and so on. But you would like to calculate everything yourself for your live trading.

@NAmorim
Copy link
Contributor Author

NAmorim commented Jan 10, 2022

There's GUNjs (https://gun.eco/) that presents solutions for some of the issues.
The best case would be to leverage any of these techs to solve some of the developments being done, as the distributed storage, the social graph and the p2p network. Appear that IPFS and GUN fits the need.

As stated before, this is based on some investigation I did last month. I'm not actively working on this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed improvement New feature or request
Projects
Development

No branches or pull requests

3 participants