Data Ingestion & Testing #35

skopula · 2020-04-14T19:51:59Z

These link below are great! to get sample log data on proxy, weblog, dns logetc
https://www.secrepo.com/
https://log-sharing.dreamhosters.com/
We can start testing with Sample user data (user1, user2 etc) and Sample Proxy log data . I will create a new folder "data" folder update folder with sample proxy log data and user data.
Let me know your thoughts ? Or is it too early on data ingesting and testing ?

Jovonni · 2020-04-14T20:16:49Z

good idea! we have the toy datasets in:
https://github.com/GACWR/OpenUBA/tree/master/test_datasets

We can add another folder instead of "toy_1" @skopula , feel free to add whatever sets of data you want in a branch. we have the toy_1 datasets sitting in hadoop as well, and simply reading them in pyspark.

Simply the datasources are provided In the scheme.json file:
https://github.com/GACWR/OpenUBA/blob/master/core/storage/scheme.json

OpenUBA/core/storage/scheme.json

Lines 1 to 16 in 5636a7b

    
           { 
        
               "mode": "test", 
        
               "folder": "../test_datasets/toy_1", 
        
               "type": "local_folder", 
        
               "data": 
        
                   [ 
        
                       { 
        
                           "log_name": "proxy", 
        
                           "type": "csv", 
        
                           "location_type": "disk", 
        
                           "folder": "proxy", 
        
                           "id_feature": "cs-username", 
        
                           "filename_scheme": "mm-dd-yyy" 
        
                       } 
        
                   ] 
        
           }

@kaiiyer and jed may have ideas on which other datasets, but these are enough for now until we finish pipelining. we also have these files being sent to a local elastic cluster as well, and are reading the elastic data in python. Will push that up.

Also, the DataSourceFileType enum in process.py defines the datafile type defines the datasource type.

OpenUBA/core/process.py

Lines 38 to 42 in 5636a7b

    
           class DataSourceFileType(Enum): 
        
               CSV = "csv" 
        
               FLAT = "flat" 
        
               PARQUET = "parquet" 
        
               JSON = "json"

The LogSourceType in dataset.py is defining the location from which we fetch the data:

OpenUBA/core/dataset.py

Lines 88 to 91 in 5636a7b

    
           class LogSourceType(Enum): 
        
               DISK = "disk" 
        
               HDFS = "hdfs" 
        
               ES = "es"

anupamme · 2022-01-20T11:09:24Z

What is the status of this issue?

I am interested in the data for UBA and I was poking around the toy_1 data folder but it is not clear to me how this data can be used for any Machine Learning task because the data does not have labels e.g. True, False if we want to build a classifier.

So I can take up this issue but I would like to understand how any existing dataset can be used for an ML task, so any guideline would be appreciated.

jedwafu · 2022-02-27T11:13:18Z

Hey @anupamme sorry for the very late response. I just returned from the grave. I got separated from the team for a long time. Looking back to this issues now. Thanks for bumping up.

anupamme · 2022-04-03T06:39:26Z

Hey, @jedwafu just checking if this issue is being looked at? And if there is any progress or timeline?

p.s. I also just came from a trip to the grave :).

Jovonni added the Data Ingestion for testing related to getting more test data label Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Ingestion & Testing #35

Data Ingestion & Testing #35

skopula commented Apr 14, 2020

Jovonni commented Apr 14, 2020

anupamme commented Jan 20, 2022

jedwafu commented Feb 27, 2022

anupamme commented Apr 3, 2022

Data Ingestion & Testing #35

Data Ingestion & Testing #35

Comments

skopula commented Apr 14, 2020

Jovonni commented Apr 14, 2020

anupamme commented Jan 20, 2022

jedwafu commented Feb 27, 2022

anupamme commented Apr 3, 2022