Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload file: allow user to specify the schema #31

Open
stechu opened this issue Dec 3, 2014 · 5 comments
Open

upload file: allow user to specify the schema #31

stechu opened this issue Dec 3, 2014 · 5 comments

Comments

@stechu
Copy link
Contributor

stechu commented Dec 3, 2014

Automatically infering schema is very cool. But sometimes user may still want to specify schema, because:

  1. To fail quickly if there is in-correct value.
  2. To work around some bad values by specifing every column as string.

Just my personal opinion.

@dhalperi
Copy link
Member

dhalperi commented Dec 3, 2014

Can you say more, with a concrete example?


Daniel Halperin
Director of Research for Scalable Data Analytics
eScience Institute
University of Washington

On Wed, Dec 3, 2014 at 11:34 AM, Shumo Chu notifications@github.com wrote:

Automatically infering schema is very cool. But sometimes user may still
want to specify schema, because:

  1. To fail quickly if there is in-correct value.
  2. To work around some bad values by specifing every column as string.

Just my personal opinion.


Reply to this email directly or view it on GitHub
#31.

@stechu
Copy link
Contributor Author

stechu commented Dec 3, 2014

One user case:

We are trying to use Myria to process Google cluster usage data, each table has 10-20 columns.

There are a lot of missing values in the CSV. During the pre-processing, we replaced the missing value with -1 (google just place nothing between two commas). The reason of using -1 is that most fields are positive numeric values if there is data. After replacing, these values can be easily filtered in the analytical query later. Now the problems are:

  1. Google specify some columns as Boolean, the current data uploading tool seems do not support boolean very well.
  2. There are some in-correct data in some columns that we do not really care. We just want to mark these columns as string by specifying the shema, messytables will infer these columns to be INT or DOUBLE somehow.

@dhalperi
Copy link
Member

dhalperi commented Dec 3, 2014

why not do the cleaning inside of Myria?

  1. CASE WHEN s == '' then -1 else int(s) end
  2. boolean(x), or something like x=='True' or x==1 or x != 0.
  3. is the concern that the strings occur after the preview, so messytables
    crashes? Pull request welcome.

Daniel Halperin
Director of Research for Scalable Data Analytics
eScience Institute
University of Washington

On Wed, Dec 3, 2014 at 11:52 AM, Shumo Chu notifications@github.com wrote:

One user case:

We are trying to use Myria to process Google cluster usage data, each
table has 10-20 columns.

There are a lot of missing values in the CSV. During the pre-processing,
we replaced the missing value with -1 (google just place nothing between
two commas). The reason of using -1 is that most fields are positive
numeric values if there is data. After replacing, these values can be
easily filtered in the analytical query later. Now the problems are:

Google specify some columns as Boolean, the current data uploading
tool seems do not support boolean very well.
2.

There are some in-correct data in some columns that we do not really
care. We just want to mark these columns as string by specifying the shema,
messytables will infer these columns to be INT or DOUBLE somehow.


Reply to this email directly or view it on GitHub
#31 (comment)
.

@stechu
Copy link
Contributor Author

stechu commented Dec 3, 2014

Dose cleaning inside myria mean ingesting the data as all string columns? Then this again needs manually specify schema.

If just simply using the upload tool, not sure what messy table will do since many null value's appear after the preview. Will update this issue once I get more data or bugs.

@dhalperi
Copy link
Member

dhalperi commented Dec 3, 2014

when the tool encounters an empty cell in the dataset, the column's schema
is automatically changed to string type and the empty cell is an empty
string.


Daniel Halperin
Director of Research for Scalable Data Analytics
eScience Institute
University of Washington

On Wed, Dec 3, 2014 at 12:14 PM, Shumo Chu notifications@github.com wrote:

Dose cleaning inside myria mean ingesting the data as all string columns?
Then this again needs manually specify schema.

If just simply using the upload tool, not sure what messy table will do
since many null value's appear after the preview. Will update this issue
once I get more data or bugs.


Reply to this email directly or view it on GitHub
#31 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants