New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
This is the second part of Datawave #716 issue. #2
base: main
Are you sure you want to change the base?
This is the second part of Datawave #716 issue. #2
Conversation
9675112
to
71b3ebd
Compare
@ivakegg, I got this to compile an pass checks by taking the -SNAPSHOT out of my version for the metadata-utils service in the pom.xml.. Unfortunately I do not see how this 1.5 version will get chosen when the datawave-parent project which uses the classes in this version gets compiled. That datawave-parent project gets this error during build due to the dependency issue: |
if (num == 0) { | ||
return new byte[] {(byte) '\u0000', (byte) '\u0000', (byte) '\u0000', (byte) '\u0000'}; | ||
} else if (num < 256) { | ||
return new byte[] {(byte) '\u0000', (byte) '\u0000', (byte) '\u0000', (byte) (num)}; | ||
} else if (num < 65536) { | ||
return new byte[] {(byte) '\u0000', (byte) '\u0000', (byte) (num >>> 8), (byte) num}; | ||
} else if (num < 16777216) { | ||
return new byte[] {(byte) '\u0000', (byte) (num >>> 16), (byte) (num >>> 8), (byte) num}; | ||
} else { // up to 2,147,483,647 | ||
return new byte[] {(byte) (num >>> 24), (byte) (num >>> 16), (byte) (num >>> 8), (byte) num}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all the if checks? Just use the statement in the last else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give that a try @ivakegg. Thanks.
byte[] frequency = Base256Compression.numToBytes(entry.getValue()); | ||
int dateOrdinal = parser.ordinalDayOfYear.getDateOrdinal(); | ||
|
||
if (dateOrdinal == BADORDINAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== or >= ?
if (byteArray == null) | ||
return 0; | ||
if (byteArray.length > 4) | ||
return 2_147_483_647; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer.MAX_VALUE ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivakegg , I am refactoring OrdinalDayOfYear.calculateOrdinal(). Don't review it for structure yet.
if (byteArray.length > 4) | ||
return 2_147_483_647; | ||
|
||
if (byteArray.length == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is overly complicated. How about:
value = ((int) b[3] & 0xff) << 24
| ((int) b[2] & 0xff) << 16
| ((int) b[1] & 0xff) << 8
| ((int) b[0] & 0xff);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried to get the code you put above to work but it doesn't. I will try to condense the method you are talking about some other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it to work. The indexes were reversed. Thanks a lot.
@ivakegg , You can start reviewing this pull request. I updated my PR on datawave parent that has the FrequencyTransformIterator in it that uses the new code in the datawave-metadata utils. The DataFrequencyValueTest in the metadata-utils project stress tests the serialization/deserialization code with a 10 year dataset. There is a "test" in the DataFrequencyValueTest that creates a script you can 'execfile' in the accumulo shell on the datawave.metadata table. I have done that and grep'ed the logs for ERROR in my classes. None found. The MetadataHelper class actually uses the functionality of all this and I will have to write code to create a client that uses that class to further verify these PRs. |
pom.xml
Outdated
@@ -104,4 +104,16 @@ | |||
<url>https://raw.githubusercontent.com/NationalSecurityAgency/datawave/mvn-repo</url> | |||
</repository> | |||
</repositories> | |||
<build> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already specified in the inherited parent pom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take this out unless you can tell me it is needed
* frequency values that are aggregated together to by the FrequencyTransformIterator and manipulated in the FrequencyFamilyCounter. The byte array is in | ||
* regular expression format (YEAR(4BYTE-FREQUENCY){366}))* . Explained verbally a four byte representation of Year followed by 366 (Leap year) 4 byte holders | ||
* for frequency. In this version all the arrays have a slot for the Feb 29 leap year frequency whether it is there or not. There aren't any delimiters between | ||
* years and frequencies which adds to the compression. Each Accumulo row for this "aggregated" frequency "map" would be 10 x ( 4 + (366 x * 4) ) bytes long for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each Accumulo row -> Each compressed value
private HashMap<String,Integer> uncompressedDateFrequencies = null; | ||
|
||
public static final int DAYS_IN_LEAP_YEAR = 366; | ||
private static int MAX_YEARS = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why max of 10 years? Perhaps we have a configurable parameter/option on the iterator that will chop data off the back end.
if (entry.getKey() == null || entry.getKey().isEmpty()) | ||
continue; | ||
|
||
log.trace("Serializing the key/value " + entry.getKey() + " value: " + entry.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (log.isTraceEnabled()) ....
Do this whenever we are doing string concatenation in the argument.
for (Map.Entry<YearKey,byte[]> entry : compressedDateFrequencies.entrySet()) { | ||
log.trace("Compressing key: " + entry.getKey().key + " value" + entry.getValue() + "Byte array " + entry.getValue()); | ||
try { | ||
theOutputstreamInUse.write(entry.getKey().compressedContent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a performance standpoint, it would be good to avoid doing intermediate byte allocations. Hence if we can write the compressed bytes directly to the output stream instead of allocation the intermediate YearKey.compressedContent arrays. This will require of course sorting the uncompressedDateFrequencies and then iterating through THE entries in the map until you hit the last sorted key in the set. You will need to write 0's in the gaps between entries of course. You can preallocate the buffer in the output stream to be the appropriate size by looking at the first and last entries in the sorted set.
log.trace("Long.parseLong could not parse " + value + " to long for this key " + cleanKey); | ||
log.trace("Trying to use Long.decode"); | ||
parsedLong = Integer.decode(value); | ||
total += parsedLong; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this total used for?
private HashMap<YearKey,byte[]> compressedDateFrequencies = new HashMap<>(); | ||
// The mapping of SimpleDateFormat strings to frequency counts. This is passed in by the | ||
// FrequencyFamilyCounter object to this classes serialize function | ||
private HashMap<String,Integer> uncompressedDateFrequencies = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything might work a little smoother and be more readable if this were a TreeMap<YearMonthDay, Frequency> instead of HashMap<String, Integer>. The Frequency object could be mutable allowing aggregation of frequencies which avoid excessive Integer object allocations. The YearMonthDay can store year and ordinal (julian date) values avoid excessive string manipulations.
class YearMonthDay implements Comparable<YearMonthDay> {
int year;
int julian;
YearMonthDay(String value) {
// parse value into year and ordinal
}
int getYear() {
return year;
}
int getJulian()
return julian;;
}
}
class Frequency {
int value
Frequency(value) {
this.value = value;
}
int getValue() {
return value;
}
void addFrequency(int value) {
this.value += value;
}
}
else { | ||
|
||
int lastValue = dateToFrequencyValueMap.get(cleanKey); | ||
dateToFrequencyValueMap.put(cleanKey, lastValue + parsedLong); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where Frequency.addFrequency() would be more efficient
import java.util.regex.Matcher; | ||
import java.util.regex.Pattern; | ||
|
||
public class RunLengthEncoding { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class actually used anywhere?
} | ||
} | ||
|
||
putFrequencyBytesInByteArray(dateFrequencyEntry, dayFrequencies); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this method write the frequencies directly to the output stream ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really great idea. My first instinct was to go ahead and implement it like you suggest. What is going in is that datFrequencies array has to be the same 366 * 4 length for all years. It has to be fully populated and initialized before it is written to the byte output stream. It gets written to the byte output stream on line 51 when there is a year change. If we do it the way you suggest I have to put in other logic that outputs 4 zero valued bytes for all the days that are not in the TreeMap. More code and more clutter. The way I am doing it only requires log to process only the days of the year in the TreeMap and the other days are automatically zeroed. The gzip in Accumulo compaction compresses the zero days as we discussed.
I think it is good as it is right now. If you want me to code the extra logic for days that are not in the map then let me know and I will go ahead. I think this is pretty fast the way it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well can we at least avoid reallocating a new array every time at line 60 and instead simply 0 our the array: Arrays.fill(dateFrequencies, (byte)0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally see where you are going here. You are really safeguarding against a memory overflow and I have not had to be this careful in past work or I just got lucky. I will implement your suggestion. Thanks for working with me here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array allocation change made and pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we could write directly to the baos. You have to zero out the array anyway so what is the difference? Just need to write zeros since the last day written when writing the next value.
} | ||
|
||
private String calculateMMDD(int ordinal) { | ||
if (ordinal > 366) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole method could be replaced with:
SimpleDateFormat formatter = new SimpleDateFormat("MMdd");
Calendar c = Calendar.getInstance();
c.set(Calendar.DAY_OF_YEAR, i);
return formatter.format(c.getTime());
If your ordinal is 0 based then you will want to set the day of year to i+1 instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requested changes have been made. Thank for the lesson on Calendar class.
} | ||
|
||
private String calculateMMDD(int ordinal) { | ||
SimpleDateFormat formatter = new SimpleDateFormat("MMdd"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth creating this as a static member and then synchronizing on it when calling the format method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the function a static member, synchronized on formatter object. No longer create the object, just use static method. Pushed code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say that but I still see a formatter being created every time this method is called.....
private boolean isLeapYear = false; | ||
private GregorianCalendar gregorianCalendar; | ||
|
||
public static final int[] MONTH_LENGTH = new int[] {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these constants still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look and remove possible construction debris.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the statics to the test class where they get used and removed construction debris.
int year, presentYear = 0; | ||
ByteArrayOutputStream baos = new ByteArrayOutputStream(); | ||
byte[] dayFrequencies = new byte[NUM_FREQUENCY_BYTES]; | ||
Arrays.fill(dayFrequencies, (byte) 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to fill with 0 just after the allocation. That is already done for you.
|
||
Value serializedMap; | ||
int year, presentYear = 0; | ||
ByteArrayOutputStream baos = new ByteArrayOutputStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could supply the size of the buffer estimated by looking at the year for the first and last entry in the passed in map.
|
||
try { | ||
baos.write(Base256Compression.numToBytes(presentYear)); | ||
Arrays.fill(dayFrequencies, (byte) 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be done the first time through the loop
|
||
private void putFrequencyBytesInByteArray(Map.Entry<YearMonthDay,Frequency> entry, byte[] dayFrequencies) { | ||
// This will always be 4 bytes now even if null it will get compressed later on. | ||
byte[] frequency = Base256Compression.numToBytes(entry.getValue().value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this method can write directly to the array instead of allocating a new 4 byte array every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array allocation is out of the code now. Will push soon.
*/ | ||
for (int j = NUM_YEAR_BYTES; j < DAYS_IN_LEAP_YEAR * NUM_BYTES_PER_FREQ_VALUE + NUM_YEAR_BYTES; j += NUM_BYTES_PER_FREQ_VALUE) { | ||
int k = i + j; | ||
byte[] encodedfrequencyOnDay = new byte[] {expandedData[k], expandedData[k + 1], expandedData[k + 2], expandedData[k + 3]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are still allocating these 4 byte arrays. Please create the method Base256Compression.bytesToInteger(expandedData, k) or Base256Compression.bytesToInteger(expandedData[k], expandedData[k+1], expandedData[k+2], expandedData[k+3]).
} | ||
|
||
private String calculateMMDD(int ordinal) { | ||
SimpleDateFormat formatter = new SimpleDateFormat("MMdd"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say that but I still see a formatter being created every time this method is called.....
The assumption that I made was that this version of the metadata-utils would be running only with the version of the main Datawave project that had the FrequencyColumnIterator in it and running at scan, minc, and majc. That is how the MetadataConfigHelper class in the main project sets that iterator up. If that is the case then the all of the "legacy" frequency data gets compacted with the new compression technique. I can make sure that the MetadataHelper function will work for sure by making calls to the old functions inside of the new functions if no compressed record is found. |
I do understand that. I guess it is ok to leave the metadata utils to only work with the new data format. |
I agree with you. We should make the code bullet proof to support legacy. |
…ents in code for that
…requency column data with using the FrequencyTransformIterator
…ncy column value is not found
…ments. * The frequency entries are now distinct by datatype * The filtering withing the metadata helper is now pushed down to interators and into the deserialization
…e specified date range
…plement Frequency Object overrides.
9bc3e3c
to
ce31178
Compare
The MetadataHelper class will use the new compressed frequency column iterator implemented in the Datawave project. The MetadataHelper needs to process date frequency rows differently now. Datawave uses tag 1.3 in this project so I branched off of the head of that. I put in a .gitignore file but it is not working for some reason. Sorry about that.