Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented where that have individual clause for each row. #1053

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

crakjie
Copy link
Contributor

@crakjie crakjie commented Nov 28, 2016

Added individual clause to the joins where.
This allow this kind of syntax :

rdd.joinWithCassandraTable(ks, tableName).where("timestampMilis = ?", (k : KVRow) => Seq(k.timestampSecond * 1000))

Of course the syntax un the where can be enhance, it's why it's open to review.

@datastax-bot
Copy link

Hi @crakjie, thanks for your contribution!

In order for us to evaluate and accept your PR, we ask that you sign Spark Cassandra Connector CLA. It's all electronic and will take just minutes.

@datastax-bot
Copy link

Thank you @crakjie for signing the Spark Cassandra Connector CLA.

@RussellSpitzer
Copy link
Contributor

Can you write a little bit of the use case for this api? I took a brief look today (sorry so busy) and I think it's a very cool Idea but i'm having a hard time thinking about how someone would actually use it?

@LukaszZu
Copy link

Hmm I think it can do possibility to join as below standard SQL query
select * from tb1 join tb2 on id = id where tb1.eventTime between tb2.from and tb2.to
When I have data as below:

tb1 (RDD)
id|eventTime|others
1|12:30|foo

tb2 (table in Cassandra)
id|from|to|asset name
1|11:10|11:40|Bar
1|11:41|15:30|What Im looking

Now as I know when I do joinwithCassandraTable I will receive
this two rows but after this patch I will receive only 1
For me it will be very nice feature but I don't know if I understood this code correctly.
Please correct me if I'm wrong.

@crakjie
Copy link
Contributor Author

crakjie commented Dec 22, 2016

I had this idea because I have to do a join over timestamp but not == timestamp.

The database was contening timestamp older than the left RDD. And each element of the left RDD was containing the information about how old RDD the element have to be joined with. So to do that I had to have an information contained in each left element.

So the general idea was to be able to modify the where close depending on each input elements.

I still don't know if the type of the "fwhere" function is good or if it can be simplified. Actually the function has to return an internal scc object ..

@RussellSpitzer
Copy link
Contributor

I'm wondering if we might be better off with just another api, like a generic "RunPreparedStatements"

Which would be something like

RDD[BoundParameters].runPreparedStatements[ReturnType]("CQL HERE with ? PARAMETERS ?")

Of which then the Joins become a child class of?

@etspaceman
Copy link
Contributor

@RusselSpitzer +1 to that idea. This would be a really great addition, giving users very strong flexibility on RDD processing.

@crakjie
Copy link
Contributor Author

crakjie commented Jan 17, 2017

Back from hollydays.
Why not @RussellSpitzer, but how de validity of the request is made?

@RussellSpitzer
Copy link
Contributor

I think we should pause on this and instead focus on making completely flexible function. Like I described above, that way we don't increase the complexity of the code as is and are able to introduce a greater amount of flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants