Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when doing any join except full outer #4

Open
carlaustin opened this issue Sep 26, 2013 · 7 comments
Open

Issue when doing any join except full outer #4

carlaustin opened this issue Sep 26, 2013 · 7 comments

Comments

@carlaustin
Copy link

Hadoop Version: 1.3
When using the accumulo storage manager to do simple joins, example below, an IllegalArguementException is thrown with the message "null columnMapping not allowed".
I have worked around this by modifying initAccumuloSerdeParameters to set the COLUMN_MAPPINGS and LIST_COLUMN_TYPES on conf as the values in properties.
It also appears that type strings can be both colon and comma separated so I created a new method in AccumuloHiveUtils to split column types string using a pattern of :|,

This now enables all types of joins. I'm happy to send the changes your way, but I do wonder whether this is a workaround rather than a fix to the root issue, but my knowledge of Hive Storage managers is very small so I can't really determine if I should be doing it differently.

example join
SELECT * from tablea a JOIN tableb b ON a.id = b.id

@bfemiano
Copy link
Owner

Both those properties are exposed and configurable already via the AccumuloSerde.java and serdeConstants.java respectively. The example join you demonstrated should be possible when joining on simple scalar value types (int, double, etc.)

@carlaustin
Copy link
Author

I know that they are exposed, but when doing the join the null column mapping error occurs. On putting a load of logging in, the .get(COLUMN_MAPPINGS) returns null, but only when doing a join, even though it should have been set. I debugged into the AccumuloSerde.initAccumuloSerdeParameters and it was clear that COLUMN_MAPPINGS was correctly on the properties object, but didn't seem to get set on the job conf, hence the null.
This means that the join doesn't work, I tried it plenty of times and plenty of ways with very simple data (two tables with just a couple of columns). FULL OUTER JOIN and any non-join query worked, but any other join throws the error. Note the joins I did were on strings.

I can send you the diffs I used make it work for me if you would like.

For info I was using the HortonWorks HDP1.3 sandbox with Accumulo installed on top to replicate and debug this issue.

I said Hadoop v1.3 in the OP by mistake, I meant 1.2.

@bfemiano
Copy link
Owner

I am about to revisit the codebase and I will see if I can reproduce this. Many of my initial test cases in the ACLED scripts did inner joins similar to the one you desrcribe not working, although not necessarily on Strings. I will see if I can reproduce on CDH4.5.

Thanks and sorry this took so long.

@carlaustin
Copy link
Author

No worries, I actually fixed it myself in my codebase.
I've also implemented basic INSERT INTO in my codebase, but this is tied to other code. I could have a look at making it more generic and providing it if you would be interested?

@bfemiano
Copy link
Owner

Sure. That would be great. I'm going to implement a simple Mutation based
output format.

Will you be at the June 12th summit?

On Wed, May 14, 2014 at 11:06 AM, carlaustin notifications@github.comwrote:

No worries, I actually fixed it myself in my codebase.
I've also implemented basic INSERT INTO in my codebase, but this is tied
to other code. I could have a look at making it more generic and providing
it if you would be interested?


Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-43092619
.

@carlaustin
Copy link
Author

I won't be at the summit unfortunately.

I've already created an OutputFormat and RecordWriter that write mutations from rows of data serialized in the AccumuloSerde. I'll look into replacing the non-generic bits so I can share it with you.

@joshelser
Copy link

For full closure, I've run a few joins so far with success in the code heading towards Hive. I'll try to add some more to exhaust the join types, but I think I have this fixed already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants