New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using data selector as wrapper over raw data #815
Conversation
@@ -0,0 +1,75 @@ | |||
package zingg.common.core.data.df; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to ZData
|
||
public static final Log LOG = LogFactory.getLog(ZFrameDataSelector.class); | ||
|
||
public ZFrameDataSelector(ZFrame<D, R, C> rawData, IArguments args, Context<S,D,R,C,T> context,StopWordsRemover<S, D, R, C, T> stopWordsRemover) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class shuld not be aware of the blocking tree loading and context etc. it should get the blocking tree and have a method to apply it.
|
||
public ZFrame<D,R,C> getBlocked() throws Exception, ZinggClientException{ | ||
if (blockedData==null) { | ||
preprocessedRepartitionedData = stopWordsRemover.preprocessForStopWords(getFieldDefColumnsDS()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a separate method
//testData = dropDuplicates(testData); | ||
long count = preprocessedRepartitionedData.count(); | ||
LOG.info("Read " + count); | ||
Analytics.track(Metric.DATA_COUNT, count, args.getCollectMetrics()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
count should be a method on this and Analytics should not be a part of this object - it should be driven from the executor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added few comments. Broadly these classes should not know about args, context etc. BlockedFrame can be passed the blockingtreeutil. We need to be able to test these classes and setup for that has to be smooth/mockable
|
||
import zingg.common.client.ZFrame; | ||
|
||
public interface IZFrameProcessor<S, D, R, C, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it need S and T?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BlockingTreeUtil<S, D,R,C,T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StopWordsRemover<S,D,R,C,T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we should have one frame<D,R,C> and then another that extends this and adds S and T.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what benefit it would give? ZData in any case has <S,D,R,C,T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessarily two very similar interfaces to be maintained whose objects can't be assigned to one another
@@ -0,0 +1,11 @@ | |||
package zingg.common.core.data.df; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be better to introduce a method process() so that the logic is not in the constructor.
super(); | ||
this.originalDF = originalDF; | ||
this.args = args; | ||
this.processedDF = getOriginalDF().select(new ZidAndFieldDefSelector(args.getFieldDefinition()).getCols()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to have an overloaded constructor here which can take the fielddefselectors. we have a few so may need to pass those in the future
|
||
public static final Log LOG = LogFactory.getLog(RepartitionFrame.class); | ||
|
||
public RepartitionFrame(ZFrame<D, R, C> originalDF, IArguments args) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should pass the numberofPartitions and column on which we need to partition
import zingg.common.client.ZinggClientException; | ||
import zingg.common.core.preprocess.StopWordsRemover; | ||
|
||
public class PreprocessedFrame<S, D, R, C, T> implements IZFrameProcessor<S, D, R, C, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change stopword design - have a processor there. so preprocessedframe can be passed any processor(stopwords, tomorrow lowercase etc)
context.getBlockingTreeUtil().readBlockingTree(args) => other places args can be avoided but blockedframe needs it |
using data selector as wrapper over raw data