Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using data selector as wrapper over raw data #815

Closed
wants to merge 7 commits into from

Conversation

vikasgupta78
Copy link
Collaborator

using data selector as wrapper over raw data

@@ -0,0 +1,75 @@
package zingg.common.core.data.df;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to ZData


public static final Log LOG = LogFactory.getLog(ZFrameDataSelector.class);

public ZFrameDataSelector(ZFrame<D, R, C> rawData, IArguments args, Context<S,D,R,C,T> context,StopWordsRemover<S, D, R, C, T> stopWordsRemover) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class shuld not be aware of the blocking tree loading and context etc. it should get the blocking tree and have a method to apply it.


public ZFrame<D,R,C> getBlocked() throws Exception, ZinggClientException{
if (blockedData==null) {
preprocessedRepartitionedData = stopWordsRemover.preprocessForStopWords(getFieldDefColumnsDS());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a separate method

//testData = dropDuplicates(testData);
long count = preprocessedRepartitionedData.count();
LOG.info("Read " + count);
Analytics.track(Metric.DATA_COUNT, count, args.getCollectMetrics());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count should be a method on this and Analytics should not be a part of this object - it should be driven from the executor

Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added few comments. Broadly these classes should not know about args, context etc. BlockedFrame can be passed the blockingtreeutil. We need to be able to test these classes and setup for that has to be smooth/mockable


import zingg.common.client.ZFrame;

public interface IZFrameProcessor<S, D, R, C, T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it need S and T?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockingTreeUtil<S, D,R,C,T>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StopWordsRemover<S,D,R,C,T>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we should have one frame<D,R,C> and then another that extends this and adds S and T.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what benefit it would give? ZData in any case has <S,D,R,C,T>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessarily two very similar interfaces to be maintained whose objects can't be assigned to one another

@@ -0,0 +1,11 @@
package zingg.common.core.data.df;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be better to introduce a method process() so that the logic is not in the constructor.

super();
this.originalDF = originalDF;
this.args = args;
this.processedDF = getOriginalDF().select(new ZidAndFieldDefSelector(args.getFieldDefinition()).getCols());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to have an overloaded constructor here which can take the fielddefselectors. we have a few so may need to pass those in the future


public static final Log LOG = LogFactory.getLog(RepartitionFrame.class);

public RepartitionFrame(ZFrame<D, R, C> originalDF, IArguments args) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should pass the numberofPartitions and column on which we need to partition

import zingg.common.client.ZinggClientException;
import zingg.common.core.preprocess.StopWordsRemover;

public class PreprocessedFrame<S, D, R, C, T> implements IZFrameProcessor<S, D, R, C, T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change stopword design - have a processor there. so preprocessedframe can be passed any processor(stopwords, tomorrow lowercase etc)

@vikasgupta78
Copy link
Collaborator Author

I have added few comments. Broadly these classes should not know about args, context etc. BlockedFrame can be passed the blockingtreeutil. We need to be able to test these classes and setup for that has to be smooth/mockable

context.getBlockingTreeUtil().readBlockingTree(args) => other places args can be avoided but blockedframe needs it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants