Skip to content

jackr276/Sparse-Matrix-Utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sparse Matrix Utilities

Author: Jack Robbins

This project provides 3 C programs that are designed to generate, print and convert binary files containing sequences of 4-byte integers. The binary files are treated as sparse matrices, that is, matrices who contain at least 50% zeroes, making them truly sparse. A full description of each of these utilities, and of sparse matrices in general, is given below.

What is a Sparse Matrix?

A sparse matrix is a matrix whose contents are at least 50% 0's. Here is a small 5 X 5 example:

0 1 0 0 0
0 0 2 0 1
0 4 0 0 0
12 0 0 0 0
3 0 0 0 0

Notice how we only have 6 nonzero values(1, 2, 3, 4 and 12) out of 25 total values, making this matrix truly sparse.

Unlike this small 5 X 5 example, these matrices are often quite large, and therefore storing them efficiently in memory/on disk provides an interesting challenge to solve. Consider a sparse matrix of unsigned integers that is 1000 rows by 1000 columns. If we know that this matrix is sparse, we know that out of the $1,000 * 1,000 = 1,000,000$ elements in the matrix, at least 50% of them, $500,000$, will be 0's. If we were to store all of these 0's in memory, we are guaranteed, in our example, to be storing at least $500,000 * 4$ Bytes/Unisgned Integer = $2,000,000$ Bytes = $2$ MegaBytes(MB) of 0's in memory, or in a file somewhere, a gross waste of resources. Remember, this is the best case scenario, as we guarantee that the matrix is at least 50% 0's, so in all likelihood we would be wasting more than $2$ MB's of memory. The problem of efficient storage is addressed below in the CSR Format conversion section.

Sparse matrices appear in several distinct areas of science and mathematics. Machine learning, combinatorics and numerical analysis are a few examples of where the concept of sparse matrices appear.

Generating Sparse Matrices

The C program generate_sparse_mat.c allows users to generate a sparse matrix of any dimension that they choose. The number of rows and columns is provided as user-input to this program upon runtime(see more about running these programs below). Once a user enters the desired dimensions, the program creates a binary file, matrix.bin, in the current directory. This program guarantees that all binary files generated will be at least 50% zeroes, so that when they are interpreted as sparse matrices, they are truly sparse.

Binary files generated by generate_sparse_mat.c follow these conventions:

  1. The first 4 bytes of the file are the number of rows the user has entered, and the second 4 bytes are the number of columns, making the first 8 bytes of the file the dimensions of the sparse matrix, and not actual values in the sparse matrix.
  2. The program will write row * column randomly generated unsigned integers to the file after writing the dimensions in the first 8 bytes. The program was written with unsigned integers in mind, but users could easily choose to interpret each of these 4-byte values as signed integers, floats, etc.
  3. The file created will always be entitled matrix.bin. If no file titled matrix.bin in the current directory exists, one will be created. Otherwise, the current matrix.bin file will be overwritten.

Additionally, it is possible to modify the source code to change the density of the sparse matrices that are generated. The constant RATIO at the top of the file may be changed to any number between 0 and 1 to adjust how dense/sparse the matrix is. The closer RATIO is to 1, the more dense the matrix will be. RATIO is set by default to 0.2, which is a safe defualt if users would like a truly sparse matrix.

Displaying Contents of Binary Files

The C program display_contents.c provides a way for users to "dump out" the contents of the binary files that generate_sparse_mat.c creates. This simple program reads in a binary file that the user gives it, and will go through the file, 4 bytes at a time, and display the unsigned integer value, hexadecimal value, and the char value of each byte for every 4 byte chunk. For example, if this program were to read in the hexadecimal value: 0x33c368e1, it would display the following:

Integer: Hex: 0x33c368e1 Dec: 868444385 
Byte 1: Hex: 0xe1 Char: \225
Byte 2: Hex: 0x68 Char: h
Byte 3: Hex: 0xc3 Char: \195
Byte 4: Hex: 0x33 Char: 3

The program diplays a message in the above format for every 4 bytes in the binary file provided, in sequential order. This program is useful for sanity checks/debugging when working with the sparse matrix binary files generated by create_sparse_mat.c.

Note

If you are more interested in signed integer values, or float values, the source code is very well documented and incredibly simple to modify.

Converting Sparse Matrices into Compressed Sparse Row(CSR) Format

What is Compressed Sparse Row(CSR) Format?

Compressed Sparse Row(CSR) format is an efficient way of storing sparse matrices. As mentioned above, since sparse matrices contain have at least half of their values set to 0, they have the potential to waste kilobytes or megabytes of space in memory or on disk. CSR format provides a way of efficiently storing sparse matrices, while also preserving all of the important information about them.

CSR format works by storing three arrays:

  1. A values array that stores all of the nonzero values
  2. A column_indices array that stores the column index of each corresponding value in their respective rows
  3. A row_start array that stores the index in the values array at which each row starts. The last value in the row_start array also happens to be the number of nonzero elements that the sparse matrix has

This way of storing sparse matrices is very efficient, as only meaningful data(values and their positions) is stored. It is also very easy to use these three arrays to reconstruct a sparse matrix should the need arise, as all necessary information is provided to do so.

The description of the row_start array in particular can be a little abstract without a real example. To make it more concrete, let's take our example from above and convert it into CSR format:

0 1 0 0 0
0 0 2 0 1
0 4 0 0 0
1 0 0 0 0
3 0 0 0 0

In compressed row format, this sparse matrix would be stored as:

dimensions 5 5
values = [1, 2, 1, 4, 1, 3]
column_indices = [1, 2, 4, 1, 0, 0]
row_start = [0, 1, 3, 4, 5, 6]

Properly reading CSR format would go as follows: index 0 in row start has a value of 0, meaning that row 0 starts at index 0 in values, and goes up to but does not include index 1 in values. values[0] is 1, and the corresponding integer in column_indices at index 0 is 1. This tells us that the value 1 is the only value in row 0, and it is located at column index 1 in row 0. Knowing this we can reconstruct row 0 as 0 1 0 0 0. This can be done for each value in the values array to reconstruct the entire sparse matrix.

The utility program convert_to_csr.c

The C program convert_to_csr.c takes in a binary file as a command line argument, and programmatically converts that file into compressed sparse row format, saving the result into a text file matrix.txt. It is important to note that convert_to_csr.c relies on binary sparse matrix files following the convention of having the first 8 bytes in the binary file represent the number of rows and columns in the sparse matrix. Binary files that violate this convention will still be interpreted, just not correctly. The program itself is very well documented, so I will not reiterate its functionality here. I would encourage anyone interested to view the source code for themselves.

Running these programs

For convenience, a runner script run.sh has been provided that will compile, take input, and run all of the programs in this project. Some examples of how to use it are below.

Note

After downloading run.sh, be sure to grant executable permissions by running chmod +x run.sh

To generate a 5X5 sparse matrix:

example@bash:~$ ./run.sh create_sparse_mat.c
If your program requires input, enter it here: 5 5

To display the contents of the file matrix.bin in the current directory

example@bash:~$ ./run.sh display_contents.c
If your program requires input, enter it here: matrix.bin

To convert the binary file in matrix.bin into compressed sparse row format

example@bash:~$ ./run.sh convert_to_csr.c
If your program requires input, enter it here: matrix.bin

About

A collection of C programs for creating, displaying and converting sparse matrices into efficient formats

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published