Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal setup takes ridiculous amounts of disk space #124

Open
lietu opened this issue Feb 5, 2017 · 8 comments
Open

Minimal setup takes ridiculous amounts of disk space #124

lietu opened this issue Feb 5, 2017 · 8 comments

Comments

@lietu
Copy link

lietu commented Feb 5, 2017

I created the most minimal database to test tiedot a bit.

I created one collection, test, and inserted one document into it:

{"test": "yep"}

I checked the filesystem, and there's 768MB of files created by tiedot.

I can understand some preallocation, but having the minimal setup take 3/4 of a GIGABYTE of disk space is quite excessive.

image

@mmindenhall
Copy link
Collaborator

I've pasted some text from a wiki page I wrote when I was evaluating tiedot for a project. The line numbers are no longer correct, but it looks like all the settings are still the same. Hope that helps!

Tiedot configuration

Tiedot pre-allocates files for all of the data structures it uses, and grows them when necessary. The default config creates a 32MB file for the data, and then one 32MB file per index (there's an id index by default, plus whatever indices we would create). That's a lot of wasted space for our constrained devices. Fortunately, the author has defined constants for all of the settings, and by choosing different initial values it's easy to start with a small disk footprint that can grow as needed. The settings below assume that there's not a lot of flash space available for storing reports, and as such no more than a few thousand reports will be stored at any given time.

There are two settings in data/collection.go that control the size of the data file and maximum document size (lines 16-17):

	COL_FILE_GROWTH = 32 * 1048576 // Collection file initial size & size growth (32 MBytes)
	DOC_MAX_ROOM    = 2 * 1048576  // Max document size (2 MBytes)

I changed these values to 4MB and 1MB respectively, which causes tiedot to pre-allocate a 4MB data file initially, and then grow that in 4MB increments.

There are several settings in data/hashtable.go that control the behavior of the on-disk hashtable implementation (lines 16-22):

	HT_FILE_GROWTH  = 32 * 1048576                          // Hash table file initial size & file growth
	ENTRY_SIZE      = 1 + 8 + 8                             // Hash entry size: validity (single byte), key (uint64), value (uint64)
	BUCKET_HEADER   = 8                                     // Bucket header size: next chained bucket number (int 10 bytes)
	PER_BUCKET      = 16                                    // Entries per bucket
	HASH_BITS       = 16                                    // Number of hash key bits
	BUCKET_SIZE     = BUCKET_HEADER + PER_BUCKET*ENTRY_SIZE // Size of a bucket
	INITIAL_BUCKETS = uint64(65536)                         // Initial number of buckets == 2 ^ HASH_BITS

I changed the following settings:

	HT_FILE_GROWTH  = 1 * 1048576                          // Hash table file initial size & file growth
	HASH_BITS       = 11                                    // Number of hash key bits
	INITIAL_BUCKETS = uint64(2048)                         // Initial number of buckets == 2 ^ HASH_BITS

With those initial settings, tiedot initially allocates a 1MB file per index.

@HouzuoGuo
Copy link
Owner

Hyvaa huomenta!

I entirely agree with you, the initial data file size can be quite large, and you probably have a 6-core Xeon E3 or Intel i7 extreme edition.

Those numbers were written down as constants because altering them after creation of a collection is quite a challenging task. If tiedot used a tree or skip list data structure for indexes, those huge numbers can be avoided. And that reminds me of the famous quote "today's constant is tomorrow's variable".

tiedot has seen very infrequent updates in the recent months, therefore it may be viable to maintain a fork with tweaked constants. I hope that helps.

@d1ngd0
Copy link
Collaborator

d1ngd0 commented Nov 14, 2017

Is there any reason these couldn't be configured using environmental variables? If you wanted me to make a pull request I'm sure I could get around to it in the next couple days.

@HouzuoGuo
Copy link
Owner

It was an incorrect decision to make them constants in the beginning.

How about this: write down collection and hashtable parameters into a JSON or text file underneath database directory. If the file exists, the parameters from the file will be used to operate on collections; if it does not exist, the default value (the current constants) will be used instead, and the file shall be created and default values written down.

@d1ngd0
Copy link
Collaborator

d1ngd0 commented Nov 15, 2017

Sounds like a plan, I will start working on a pr

@HouzuoGuo
Copy link
Owner

Wonderful to hear! Many thanks for your help.

@d1ngd0
Copy link
Collaborator

d1ngd0 commented Nov 24, 2017

#157 Added this, let me know what you think.

@yanmingsohu
Copy link

import (
	"github.com/HouzuoGuo/tiedot/db"
	"strconv"
	"io/ioutil"
	"os"
	//"sync"
)

const (
  NUM_PARTS = 2
)

//
// Pre-write concurrent configuration to prevent excessive hard disk pre-allocation
//
func writeDBConfig(dbname string) (err error) {
	// Multi-threaded locking
	num := []byte(strconv.Itoa(NUM_PARTS))
	numFile := "./db_base_path/"+ dbname +"/"+ db.PART_NUM_FILE
  
	// Ignore if the file already exists
	if _, err := os.Stat(numFile); err == nil {
		return nil
	}
	if err := ioutil.WriteFile(numFile, num, 0600); err != nil {
		return err
	}
	return nil
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants