Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Ingestion & Testing #35

Open
skopula opened this issue Apr 14, 2020 · 4 comments
Open

Data Ingestion & Testing #35

skopula opened this issue Apr 14, 2020 · 4 comments
Labels
Data Ingestion for testing related to getting more test data

Comments

@skopula
Copy link
Contributor

skopula commented Apr 14, 2020

These link below are great! to get sample log data on proxy, weblog, dns logetc
https://www.secrepo.com/
https://log-sharing.dreamhosters.com/
We can start testing with Sample user data (user1, user2 etc) and Sample Proxy log data . I will create a new folder "data" folder update folder with sample proxy log data and user data.
Let me know your thoughts ? Or is it too early on data ingesting and testing ?

@Jovonni
Copy link
Collaborator

Jovonni commented Apr 14, 2020

good idea! we have the toy datasets in:
https://github.com/GACWR/OpenUBA/tree/master/test_datasets

We can add another folder instead of "toy_1" @skopula , feel free to add whatever sets of data you want in a branch. we have the toy_1 datasets sitting in hadoop as well, and simply reading them in pyspark.

Simply the datasources are provided In the scheme.json file:
https://github.com/GACWR/OpenUBA/blob/master/core/storage/scheme.json

{
"mode": "test",
"folder": "../test_datasets/toy_1",
"type": "local_folder",
"data":
[
{
"log_name": "proxy",
"type": "csv",
"location_type": "disk",
"folder": "proxy",
"id_feature": "cs-username",
"filename_scheme": "mm-dd-yyy"
}
]
}

@kaiiyer and jed may have ideas on which other datasets, but these are enough for now until we finish pipelining. we also have these files being sent to a local elastic cluster as well, and are reading the elastic data in python. Will push that up.

Also, the DataSourceFileType enum in process.py defines the datafile type defines the datasource type.

OpenUBA/core/process.py

Lines 38 to 42 in 5636a7b

class DataSourceFileType(Enum):
CSV = "csv"
FLAT = "flat"
PARQUET = "parquet"
JSON = "json"

The LogSourceType in dataset.py is defining the location from which we fetch the data:

OpenUBA/core/dataset.py

Lines 88 to 91 in 5636a7b

class LogSourceType(Enum):
DISK = "disk"
HDFS = "hdfs"
ES = "es"

@Jovonni Jovonni added the Data Ingestion for testing related to getting more test data label Apr 14, 2020
@anupamme
Copy link

What is the status of this issue?

I am interested in the data for UBA and I was poking around the toy_1 data folder but it is not clear to me how this data can be used for any Machine Learning task because the data does not have labels e.g. True, False if we want to build a classifier.

So I can take up this issue but I would like to understand how any existing dataset can be used for an ML task, so any guideline would be appreciated.

@jedwafu
Copy link

jedwafu commented Feb 27, 2022

Hey @anupamme sorry for the very late response. I just returned from the grave. I got separated from the team for a long time. Looking back to this issues now. Thanks for bumping up.

@anupamme
Copy link

anupamme commented Apr 3, 2022

Hey, @jedwafu just checking if this issue is being looked at? And if there is any progress or timeline?

p.s. I also just came from a trip to the grave :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Ingestion for testing related to getting more test data
Projects
None yet
Development

No branches or pull requests

4 participants