Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use number of clusters as an RFI detector #49

Open
caseyjlaw opened this issue Dec 13, 2018 · 7 comments
Open

Use number of clusters as an RFI detector #49

caseyjlaw opened this issue Dec 13, 2018 · 7 comments

Comments

@caseyjlaw
Copy link
Contributor

We sometimes find that RFI can trigger many detections and do not cluster well. They tend to be found with many different (l, m, DM, dt).
Could we use the number of clusters in a segment as a trigger for rejecting RFI? For example, we could set a parameter in the preferences (e.g., max_clusters) that is tested after clustering. If more than that many are found, then reject all candidates in the segment. Or potentially, one could just use that to trigger the generation of a single candidate plot, rather than all in the segment.

@caseyjlaw
Copy link
Contributor Author

@KshitijAggarwal Do you have an opinion on what would be a good value for a parameter like this?

@KshitijAggarwal
Copy link
Collaborator

yes, such a parameter could be useful, given that we are confident that clustering parameters are appropriate for that image size.

A value of a few hundred clusters should be high enough for RFI. I would still be a little cautious, as I have seen cases in which a particular combination of preferences would cause an injected transient to trigger hundreds of clusters, which could then be solved by increasing the clustering parameters.

I think it would be better to set such a max_clusters parameter to a high number say 500, and then to recluster candidates at a higher value of min_cluster_size (and repeat the process till the number of candidates fall below that threshold). Using this one can limit the number of plots generated from a single segment.

@KshitijAggarwal
Copy link
Collaborator

Typically one can observe a "knee" in the number of clusters vs min_cluster_size plot, as after a certain value of min_cluster_size, there is a sharp decrease in the number of clusters, but my understanding is that HDBSCAN is relatively robust as compared to other clustering algorithms like knn or fof.

@caseyjlaw
Copy link
Contributor Author

Interesting.
So you suggest we cluster once and see how many clusters are found. If too many are found (e.g., a few hundred), then we cluster again with a larger min_cluster_size?
What do we learn if there are fewer clusters the second time? Does that help us understand if it is RFI or not?

@KshitijAggarwal
Copy link
Collaborator

Fewer clusters would demonstrate that the clustering has been done properly atleast on the real events, if any, in that data. This is primarily to avoid rejecting real events, which generated lots of clusters due to non optimal clustering parameters.
Also, I have noticed that once HDBSCAN has identified all the obvious clusters, it won't over cluster the candidates if the min_cluster_size is increased a little bit, so it is inherently a little robust to that.

@caseyjlaw
Copy link
Contributor Author

caseyjlaw commented Dec 19, 2018

If you can suggest the steps we can use to identify good/bad clustering, I could try it. So far, I am not too worried about this issue, since we are doing pretty well clustering RFI and good transients in the newest observations.

One important thing to remember is that we want to be sensitive to clusters with only a few candidates (e.g., 2 or 3). Be sure to include those kinds of simulated transients in your tests.

@KshitijAggarwal
Copy link
Collaborator

I will do some tests, and will let you know soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants