Simple class for CSR matrix generating from text data

Useful python script for converting csv-log files to compressed sparse row.

Compressed sparse matrice preparation
Sliding window reading capability
Numpy realization

Sparse matrices for efficient machine learning

Sparse matrices are regular used in machine learning. But it have lots of zero values, we can apply special algorithms that will do two important things:

compress the memory footprint of our matrix object
speed up many machine learning routines

Since storing all those zero values is a waste, we can apply data compression techniques to minimize the amount of data we need to store. That is not the only benefit, however. Users of sklearn will note that all native machine learning algorithms require data matrices to be in-memory. Said another way, the machine learning process breaks down when a data matrix (usually called a dataframe) does not fit into RAM. One of the perks of converting a dense data matrix to sparse is that in many cases it is possible to compress it so that it can fit in RAM.[1]

The compressed sparse row (CSR) or compressed row storage (CRS) format represents a matrix M by three (one-dimensional) arrays, that respectively contain nonzero values, the extents of rows, and column indices. It is similar to COO, but compresses the row indices, hence the name. This format allows fast row access and matrix-vector multiplications (Mx). The CSR format has been in use since at least the mid-1960s, with the first complete description appearing in 1967.[2]

Snippet class example

Trivial task:

Read provided log files and prepare it for machine learning model that predicts user behaviour

Dataset of web-browsed logs was used from the paper: A Tool for Classification of Sequential Data. Giacomo Kahn, Yannick Loiseau and Olivier Raynaud

Solution:

Python script reads csv files using sliding window and creates dataframe. For learning efficiency and memory usage optimization script also converts data frame to compressed sparse matrice.

Using:

Read csv data to dataframe. By default session size is 10 sites. Window is equal to session size (no sliding).

PATH_TO_DATA = './data/test'
reader = SequentialData2Sparse()
df = reader.test_csv_read(PATH_TO_DATA)
print(df.head())

             timestamp             site  user_id
0  2013-11-15 09:28:17           vk.com        1
1  2013-11-15 09:33:04       oracle.com        1
2  2013-11-15 09:52:48       oracle.com        1
3  2013-11-15 11:37:26  geo.mozilla.org        1
4  2013-11-15 11:40:32       oracle.com        1

Convert dataframe to sparse matrice

X, y = reader.convert_dataframe(df)
print (X) #sparse
print (X.todense()) #densed
print (y) #target values

  (0, 0)	1
  (0, 1)	3
  (0, 2)	1
  (0, 4)	1
  (0, 7)	1
  (0, 8)	1
  
[[1 3 1 0 1 0 0 1 1 1 1]
 [3 0 1 0 0 0 0 0 0 0 0]
 [0 2 1 0 0 2 0 0 0 0 0]
 [4 2 0 2 1 0 1 0 0 0 0]
 [1 1 0 1 0 0 0 0 0 0 0]]
 
 [1. 1. 2. 3. 3.]
...

How to start?

git clone https://github.com/alm4z/sparse-from-sequential-data.git
cd sparse-from-sequential-data
python data2sparse.py

Contact

If you have any further questions or suggestions, please, do not hesitate to contact by email at a.sadenov@gmail.com.

References

Great article from David Ziganto - Sparse Matrices For Efficient Machine Learning
Wikipedia article - Sparse Matrices

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data/test		data/test
LICENSE		LICENSE
README.md		README.md
data2sparse.py		data2sparse.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/test

data/test

LICENSE

LICENSE

README.md

README.md

data2sparse.py

data2sparse.py

Repository files navigation

Simple class for CSR matrix generating from text data

Sparse matrices for efficient machine learning

Snippet class example

How to start?

Contact

References

License

About

Releases

Packages

Languages

License

alm4z/sparse-from-sequential-data

Folders and files

Latest commit

History

Repository files navigation

Simple class for CSR matrix generating from text data

Sparse matrices for efficient machine learning

Snippet class example

How to start?

Contact

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages