Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More appropriate usage of NThreshold #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

More appropriate usage of NThreshold #21

wants to merge 1 commit into from

Conversation

Jorfun
Copy link

@Jorfun Jorfun commented Sep 23, 2016

NThreshold (normalization threshold value) can be used in SAX to deal with one special case (normalization to a subsequence which is almost constant) as mentioned in "Experiencing SAX: a novel symbolic representation of time series". It assigns the entire word to the middle-ranged alphabet when the standard deviation of current subsequence is slower than NThreshold.

However, I think it's not appropriate to use it in the z-normalization step when we are going to do actual distance calculations of subsequence pairs. In your implementation, the "znorm" method within class "TSProcessor" will return array of zeros when an input subsequence have a small standard deviation (smaller than NThreshold value like 0.01). By doing so, a subsequence with little fluctuation will be considered as a horizontal straight line, the original information of this subsequence is lost. Then we will get incorrect distance results between subsequences.

And what if we need to set a higher value of NThreshold under some scenarios? Many subsequences will lose their wave shape information. We will get incorrect distance values and poor discord results.

In conclusion, I think using the normalization threshold value in the z-normalization step before creating SAX representation is great for countering extreme cases. If we set unsuitable SAX related parameters, it will only influence the efficiency of HOT-SAX (SAX only influence the heuristic order of outer and inner loop), not the final discord results. But using this value in the z-normalization step before calculating actual distances is bad, because it will distort original information of subsequences and leads to poor results. In this case, if we change the value of NThreshold, we can get different discord results.

Therefore I think it's better to have two versions of "znorm" method for these two different situations. (1. z-normalization for SAX 2. z-normalization for actual distance calculation)

I wish my explanation makes sense to you : )

@codecov-io
Copy link

codecov-io commented Sep 23, 2016

Current coverage is 88.97% (diff: 100%)

Merging #21 into master will increase coverage by 0.98%

@@             master        #21   diff @@
==========================================
  Files            25         25          
  Lines          1690       1696     +6   
  Methods           0          0          
  Messages          0          0          
  Branches        311        312     +1   
==========================================
+ Hits           1487       1509    +22   
+ Misses          133        117    -16   
  Partials         70         70          

Powered by Codecov. Last update 97969d0...b1f0c39

@seninp
Copy link
Member

seninp commented Sep 26, 2016

Thank you for thoughts... I'm a bit not sure about this change however... I guess the concrete examples of discord finding in various time series would help. It'll be really helpful if you have time to run few experiments... I'll also have time to do it later on, then will decide on the change... Thanks!

@Jorfun
Copy link
Author

Jorfun commented Sep 27, 2016

I use the datasets listed in http://www.cs.ucr.edu/~eamonn/discords/.

For ECG dataset - xmitdb_x108_0, there are two sequences within this dataset. And I have tried several NThreshold values ranged from 0.01 to 0.2 with the interval of 0.01 (0.01, 0.02, 0.03 ..... 0.2) to both of them. Other parameters are fixed as: discordsNumToReport = 1, windowSize = 200, paaSize = 5, alphabetSize = 3, strategy = NumerosityReductionStrategy.NONE.

Here are the results obtained from your implementation:
First sequence:
NThreshold: 0.01 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.02 Brute force discord: 4115 SAX word: acabc
NThreshold: 0.03 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.04 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.05 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.06 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.07 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.08 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.09 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.1 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.11 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.12 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.13 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.14 Brute force discord: 3996 SAX word: cbcab
NThreshold: 0.15 Brute force discord: 3998 SAX word: cbbab
NThreshold: 0.16 Brute force discord: 3928 SAX word: acbab
NThreshold: 0.17 Brute force discord: 3926 SAX word: acbbb
NThreshold: 0.18 Brute force discord: 3925 SAX word: acbbb
NThreshold: 0.19 Brute force discord: 5180 SAX word: bccba
NThreshold: 0.2 Brute force discord: 3921 SAX word: acbbb

Second sequence:
NThreshold: 0.01 Brute force discord: 4241 SAX word: bbabb
NThreshold: 0.02 Brute force discord: 4228 SAX word: bbbbb
NThreshold: 0.03 Brute force discord: 3932 SAX word: bcaab
NThreshold: 0.04 Brute force discord: 3934 SAX word: bcaab
NThreshold: 0.05 Brute force discord: 3940 SAX word: ccaab
NThreshold: 0.06 Brute force discord: 3940 SAX word: ccaab
NThreshold: 0.07 Brute force discord: 3940 SAX word: ccaab
NThreshold: 0.08 Brute force discord: 3935 SAX word: bcaab
NThreshold: 0.09 Brute force discord: 3935 SAX word: bcaab
NThreshold: 0.1 Brute force discord: 3935 SAX word: bcaab
NThreshold: 0.11 Brute force discord: 3935 SAX word: bcaab
NThreshold: 0.12 Brute force discord: 3981 SAX word: cbbca
NThreshold: 0.13 Brute force discord: 3940 SAX word: ccaab
NThreshold: 0.14 Brute force discord: 3940 SAX word: ccaab
NThreshold: 0.15 Brute force discord: 3940 SAX word: ccaab
NThreshold: 0.16 Brute force discord: 3982 SAX word: cbbca
NThreshold: 0.17 Brute force discord: 3980 SAX word: cbbca
NThreshold: 0.18 Brute force discord: 3980 SAX word: cbbca
NThreshold: 0.19 Brute force discord: 3980 SAX word: cbbca
NThreshold: 0.2 Brute force discord: 3987 SAX word: cbbca

Supplement of my opinion: In "Finding the most unusual time series subsequence: algorithms and applications", the authors mentioned that the cardinality of the SAX alphabet size a, and the SAX word size w only affect the efficiency of our algorithm, not the final result, which depends only on the user supplied length of the discord (First paragraph of 4.2). Although NThreshold is not mentioned in this paper, in essence it's also one of the SAX related parameters as previous two. If Nthreshold influences the final result, this will introduce additional complexity to users when they set parameters of HOT-SAX.

Here are the results obtained from modified implementation:
First sequence:
NThreshold: 0.01 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.02 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.03 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.04 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.05 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.06 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.07 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.08 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.09 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.1 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.11 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.12 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.13 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.14 Brute force discord: 4018 SAX word: cbcab
NThreshold: 0.15 Brute force discord: 4018 SAX word: bbbbb
NThreshold: 0.16 Brute force discord: 4018 SAX word: bbbbb
NThreshold: 0.17 Brute force discord: 4018 SAX word: bbbbb
NThreshold: 0.18 Brute force discord: 4018 SAX word: bbbbb
NThreshold: 0.19 Brute force discord: 4018 SAX word: bbbbb
NThreshold: 0.2 Brute force discord: 4018 SAX word: bbbbb

Second sequence:
NThreshold: 0.01 Brute force discord: 4241 SAX word: bbabb
NThreshold: 0.02 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.03 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.04 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.05 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.06 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.07 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.08 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.09 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.1 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.11 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.12 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.13 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.14 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.15 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.16 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.17 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.18 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.19 Brute force discord: 4241 SAX word: bbbbb
NThreshold: 0.2 Brute force discord: 4241 SAX word: bbbbb

For ann_gun_CentroidA, there are also two sequences within this dataset. I ran your implementation with similar parameters as previous experiment (discordsNumToReport = 1, windowSize = 200, paaSize = 5, alphabetSize = 3, strategy = NumerosityReductionStrategy.NONE, NThreshold = 0.01 ~ 0.2). This time all locations of discords I got are identical. The reason for this result is the uniqueness of the sequences (the only discord is quite predominant within the whole sequence), therefore the value of NThreshold won't influence the final discord result.

I'm not sure if these are the experiments you want. You can tell me if there are still some aspects need to be confirmed, I'm willing to offer help.

@seninp
Copy link
Member

seninp commented Oct 1, 2016

Thank you for running the experiment, it makes perfect sense. I now recall that the first implementation wasn't based on z-normalized subsequences, but then we changed that for some reason... I'll talk with Jessica this Wednesday to see what she thinks about that and will get back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants