upload file: allow user to specify the schema #31

stechu · 2014-12-03T19:34:20Z

Automatically infering schema is very cool. But sometimes user may still want to specify schema, because:

To fail quickly if there is in-correct value.
To work around some bad values by specifing every column as string.

Just my personal opinion.

dhalperi · 2014-12-03T19:36:46Z

Can you say more, with a concrete example?

Daniel Halperin
Director of Research for Scalable Data Analytics
eScience Institute
University of Washington

On Wed, Dec 3, 2014 at 11:34 AM, Shumo Chu notifications@github.com wrote:

Automatically infering schema is very cool. But sometimes user may still
want to specify schema, because:

To fail quickly if there is in-correct value.

To work around some bad values by specifing every column as string.

Just my personal opinion.

—
Reply to this email directly or view it on GitHub
#31.

stechu · 2014-12-03T19:52:20Z

One user case:

We are trying to use Myria to process Google cluster usage data, each table has 10-20 columns.

There are a lot of missing values in the CSV. During the pre-processing, we replaced the missing value with -1 (google just place nothing between two commas). The reason of using -1 is that most fields are positive numeric values if there is data. After replacing, these values can be easily filtered in the analytical query later. Now the problems are:

Google specify some columns as Boolean, the current data uploading tool seems do not support boolean very well.
There are some in-correct data in some columns that we do not really care. We just want to mark these columns as string by specifying the shema, messytables will infer these columns to be INT or DOUBLE somehow.

dhalperi · 2014-12-03T19:58:03Z

why not do the cleaning inside of Myria?

CASE WHEN s == '' then -1 else int(s) end
boolean(x), or something like x=='True' or x==1 or x != 0.
is the concern that the strings occur after the preview, so messytables
crashes? Pull request welcome.

Daniel Halperin
Director of Research for Scalable Data Analytics
eScience Institute
University of Washington

On Wed, Dec 3, 2014 at 11:52 AM, Shumo Chu notifications@github.com wrote:

One user case:

We are trying to use Myria to process Google cluster usage data, each
table has 10-20 columns.

There are a lot of missing values in the CSV. During the pre-processing,
we replaced the missing value with -1 (google just place nothing between
two commas). The reason of using -1 is that most fields are positive
numeric values if there is data. After replacing, these values can be
easily filtered in the analytical query later. Now the problems are:

Google specify some columns as Boolean, the current data uploading
tool seems do not support boolean very well.
2.

There are some in-correct data in some columns that we do not really
care. We just want to mark these columns as string by specifying the shema,
messytables will infer these columns to be INT or DOUBLE somehow.

—
Reply to this email directly or view it on GitHub
#31 (comment)
.

stechu · 2014-12-03T20:14:45Z

Dose cleaning inside myria mean ingesting the data as all string columns? Then this again needs manually specify schema.

If just simply using the upload tool, not sure what messy table will do since many null value's appear after the preview. Will update this issue once I get more data or bugs.

dhalperi · 2014-12-03T20:26:53Z

when the tool encounters an empty cell in the dataset, the column's schema
is automatically changed to string type and the empty cell is an empty
string.

Daniel Halperin
Director of Research for Scalable Data Analytics
eScience Institute
University of Washington

On Wed, Dec 3, 2014 at 12:14 PM, Shumo Chu notifications@github.com wrote:

Dose cleaning inside myria mean ingesting the data as all string columns?
Then this again needs manually specify schema.

If just simply using the upload tool, not sure what messy table will do
since many null value's appear after the preview. Will update this issue
once I get more data or bugs.

—
Reply to this email directly or view it on GitHub
#31 (comment)
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upload file: allow user to specify the schema #31

upload file: allow user to specify the schema #31

stechu commented Dec 3, 2014

dhalperi commented Dec 3, 2014

stechu commented Dec 3, 2014

dhalperi commented Dec 3, 2014

stechu commented Dec 3, 2014

dhalperi commented Dec 3, 2014

upload file: allow user to specify the schema #31

upload file: allow user to specify the schema #31

Comments

stechu commented Dec 3, 2014

dhalperi commented Dec 3, 2014

stechu commented Dec 3, 2014

dhalperi commented Dec 3, 2014

stechu commented Dec 3, 2014

dhalperi commented Dec 3, 2014