New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuse, split, MultiStreamScoreMean, and merge all #767
Conversation
src/unitxt/fusion.py
Outdated
""" | ||
|
||
origins: List[SourceOperator] | ||
origins: Dict[str, SourceOperator] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
origins: Dict[str, SourceOperator] | |
origins: Union[List[SourceOperator], Dict[str, SourceOperator]] |
I think we should allow also anonymus sources merging which are merged without given name per source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/unitxt/fusion.py
Outdated
if "past_groups" in instance: | ||
instance["past_groups"] = ( | ||
instance["group"] + "/" + instance["past_groups"] | ||
) | ||
else: | ||
instance["past_groups"] = instance["group"] | ||
instance["group"] = origin_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "past_groups" in instance: | |
instance["past_groups"] = ( | |
instance["group"] + "/" + instance["past_groups"] | |
) | |
else: | |
instance["past_groups"] = instance["group"] | |
instance["group"] = origin_name | |
group = origin_name + "/" + instance["group"] | |
else: | |
group = origin_name | |
instance["group"] = group |
Why not like this? more simple and dont require additional variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that code that reads the "group" (e.g, the splitter) will not need to check if any past is glued to this group, and if glued -- peel it off. Only if the consuming code is explicitly interested in the past, it will look into "past_groups", and do whatever it wants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed per your request, @elronbandel. We will know what to do if the past haunts us..
src/unitxt/fusion.py
Outdated
if "group" in instance: | ||
if "past_groups" in instance: | ||
instance["past_groups"] = ( | ||
instance["group"] + "/" + instance["past_groups"] | ||
) | ||
else: | ||
instance["past_groups"] = instance["group"] | ||
instance["group"] = origin_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here dont think past_groups are necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, same thing: trying to work here, in order to save work from the consure (e.g. Splitter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments, in general:
- I think we should have only one field "group" containing all the grouping information.
- I think we should test also hirarchial group naming works within the unittests
- I think we should allow also Fusion of streams without names
- We should test the tests actually reutrn score per group with the existing mecahinsm (i linked the exact line in the metric pipeline in the related issue)
OK. I just thought to save work from the consuming code, but accept your judgement, of course. Exactly this line that you linked to in the issue, reads: As it is, it can not work when group carries the whole history, unless you would like indeed to split them all by the whole history, and not just by the most recent fuse. |
a1976e7
to
03c2897
Compare
809f987
to
a9dc1ff
Compare
e479174
to
210b4d5
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #767 +/- ##
==========================================
+ Coverage 91.48% 92.14% +0.65%
==========================================
Files 100 103 +3
Lines 10169 10614 +445
==========================================
+ Hits 9303 9780 +477
+ Misses 866 834 -32 ☔ View full report in Codecov by Sentry. |
b2203d4
to
7210ed8
Compare
src/unitxt/operators.py
Outdated
@@ -1560,6 +1560,48 @@ def process(self, multi_stream: MultiStream) -> MultiStream: | |||
return MultiStream(result) | |||
|
|||
|
|||
class SplitByGroup(MultiStreamOperator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class SplitByGroup(MultiStreamOperator): | |
class SplitByNestedGroup(MultiStreamOperator): | |
field: str = "group" |
7da0bfc
to
fceaeca
Compare
360bd25
to
60fc7b0
Compare
Hi @elronbandel , I hope this is more in the direction you had in mind. |
splits: List of splits to include. If None, all splits are included. | ||
origins: Dict of named SourceOperator objects (each to yield a MultiStream), or a list thereof | ||
splits: List of splits (stream_names) to include, over all input multistreams. If None, all splits are included. | ||
max_instances_per_origin_split: Number of instances to take from each input split of each input multistream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! the old name was disambgious!
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
… the operation done Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
…of just one stream - it returns it as is. No need to generate a nested structure for just one leaf. Signed-off-by: dafnapension <dafnashein@yahoo.com>
9ae5ee7
to
e829415
Compare
of each instance
per issue #764