Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is the second part of Datawave #716 issue. #2

Open
wants to merge 80 commits into
base: main
Choose a base branch
from

Conversation

jzgithub1
Copy link

The MetadataHelper class will use the new compressed frequency column iterator implemented in the Datawave project. The MetadataHelper needs to process date frequency rows differently now. Datawave uses tag 1.3 in this project so I branched off of the head of that. I put in a .gitignore file but it is not working for some reason. Sorry about that.

@jzgithub1
Copy link
Author

@ivakegg, I got this to compile an pass checks by taking the -SNAPSHOT out of my version for the metadata-utils service in the pom.xml.. Unfortunately I do not see how this 1.5 version will get chosen when the datawave-parent project which uses the classes in this version gets compiled. That datawave-parent project gets this error during build due to the dependency issue:
[ERROR] Failed to execute goal on project datawave-core: Could not resolve dependencies for project gov.nsa.datawave:datawave-core:jar:2.10.0-SNAPSHOT: Could not find artifact gov.nsa.datawave.microservice:metadata-utils:jar:1.5 in datawave-github-mvn-repo (https://raw.githubusercontent.com/NationalSecurityAgency/datawave/mvn-repo) -> [Help 1]

Comment on lines 269 to 279
if (num == 0) {
return new byte[] {(byte) '\u0000', (byte) '\u0000', (byte) '\u0000', (byte) '\u0000'};
} else if (num < 256) {
return new byte[] {(byte) '\u0000', (byte) '\u0000', (byte) '\u0000', (byte) (num)};
} else if (num < 65536) {
return new byte[] {(byte) '\u0000', (byte) '\u0000', (byte) (num >>> 8), (byte) num};
} else if (num < 16777216) {
return new byte[] {(byte) '\u0000', (byte) (num >>> 16), (byte) (num >>> 8), (byte) num};
} else { // up to 2,147,483,647
return new byte[] {(byte) (num >>> 24), (byte) (num >>> 16), (byte) (num >>> 8), (byte) num};
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all the if checks? Just use the statement in the last else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give that a try @ivakegg. Thanks.

byte[] frequency = Base256Compression.numToBytes(entry.getValue());
int dateOrdinal = parser.ordinalDayOfYear.getDateOrdinal();

if (dateOrdinal == BADORDINAL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== or >= ?

if (byteArray == null)
return 0;
if (byteArray.length > 4)
return 2_147_483_647;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.MAX_VALUE ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivakegg , I am refactoring OrdinalDayOfYear.calculateOrdinal(). Don't review it for structure yet.

if (byteArray.length > 4)
return 2_147_483_647;

if (byteArray.length == 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is overly complicated. How about:
value = ((int) b[3] & 0xff) << 24
| ((int) b[2] & 0xff) << 16
| ((int) b[1] & 0xff) << 8
| ((int) b[0] & 0xff);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried to get the code you put above to work but it doesn't. I will try to condense the method you are talking about some other way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it to work. The indexes were reversed. Thanks a lot.

@jzgithub1
Copy link
Author

@ivakegg , You can start reviewing this pull request. I updated my PR on datawave parent that has the FrequencyTransformIterator in it that uses the new code in the datawave-metadata utils. The DataFrequencyValueTest in the metadata-utils project stress tests the serialization/deserialization code with a 10 year dataset. There is a "test" in the DataFrequencyValueTest that creates a script you can 'execfile' in the accumulo shell on the datawave.metadata table. I have done that and grep'ed the logs for ERROR in my classes. None found. The MetadataHelper class actually uses the functionality of all this and I will have to write code to create a client that uses that class to further verify these PRs.

@jzgithub1 jzgithub1 changed the title WIP : This is the second part of Datawave #716 issue. This is the second part of Datawave #716 issue. Jul 31, 2020
pom.xml Outdated
@@ -104,4 +104,16 @@
<url>https://raw.githubusercontent.com/NationalSecurityAgency/datawave/mvn-repo</url>
</repository>
</repositories>
<build>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already specified in the inherited parent pom.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take this out unless you can tell me it is needed

* frequency values that are aggregated together to by the FrequencyTransformIterator and manipulated in the FrequencyFamilyCounter. The byte array is in
* regular expression format (YEAR(4BYTE-FREQUENCY){366}))* . Explained verbally a four byte representation of Year followed by 366 (Leap year) 4 byte holders
* for frequency. In this version all the arrays have a slot for the Feb 29 leap year frequency whether it is there or not. There aren't any delimiters between
* years and frequencies which adds to the compression. Each Accumulo row for this "aggregated" frequency "map" would be 10 x ( 4 + (366 x * 4) ) bytes long for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Accumulo row -> Each compressed value

private HashMap<String,Integer> uncompressedDateFrequencies = null;

public static final int DAYS_IN_LEAP_YEAR = 366;
private static int MAX_YEARS = 10;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why max of 10 years? Perhaps we have a configurable parameter/option on the iterator that will chop data off the back end.

if (entry.getKey() == null || entry.getKey().isEmpty())
continue;

log.trace("Serializing the key/value " + entry.getKey() + " value: " + entry.getValue());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (log.isTraceEnabled()) ....
Do this whenever we are doing string concatenation in the argument.

for (Map.Entry<YearKey,byte[]> entry : compressedDateFrequencies.entrySet()) {
log.trace("Compressing key: " + entry.getKey().key + " value" + entry.getValue() + "Byte array " + entry.getValue());
try {
theOutputstreamInUse.write(entry.getKey().compressedContent);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a performance standpoint, it would be good to avoid doing intermediate byte allocations. Hence if we can write the compressed bytes directly to the output stream instead of allocation the intermediate YearKey.compressedContent arrays. This will require of course sorting the uncompressedDateFrequencies and then iterating through THE entries in the map until you hit the last sorted key in the set. You will need to write 0's in the gaps between entries of course. You can preallocate the buffer in the output stream to be the appropriate size by looking at the first and last entries in the sorted set.

log.trace("Long.parseLong could not parse " + value + " to long for this key " + cleanKey);
log.trace("Trying to use Long.decode");
parsedLong = Integer.decode(value);
total += parsedLong;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this total used for?

private HashMap<YearKey,byte[]> compressedDateFrequencies = new HashMap<>();
// The mapping of SimpleDateFormat strings to frequency counts. This is passed in by the
// FrequencyFamilyCounter object to this classes serialize function
private HashMap<String,Integer> uncompressedDateFrequencies = null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything might work a little smoother and be more readable if this were a TreeMap<YearMonthDay, Frequency> instead of HashMap<String, Integer>. The Frequency object could be mutable allowing aggregation of frequencies which avoid excessive Integer object allocations. The YearMonthDay can store year and ordinal (julian date) values avoid excessive string manipulations.

class YearMonthDay implements Comparable<YearMonthDay> {
  int year;
  int julian;
  YearMonthDay(String value) {
     // parse value into year and ordinal
  }
  int getYear() {
     return year;
  }
  int getJulian()
     return julian;;
  }
}
class Frequency {
  int value 
  Frequency(value) {
     this.value = value;
  }
  int getValue()  {
     return value;
  }
  void addFrequency(int value) {
     this.value += value;
  }
}

else {

int lastValue = dateToFrequencyValueMap.get(cleanKey);
dateToFrequencyValueMap.put(cleanKey, lastValue + parsedLong);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where Frequency.addFrequency() would be more efficient

src/main/java/datawave/query/util/OrdinalDayOfYear.java Outdated Show resolved Hide resolved
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RunLengthEncoding {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class actually used anywhere?

}
}

putFrequencyBytesInByteArray(dateFrequencyEntry, dayFrequencies);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this method write the frequencies directly to the output stream ?

Copy link
Author

@jzgithub1 jzgithub1 Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really great idea. My first instinct was to go ahead and implement it like you suggest. What is going in is that datFrequencies array has to be the same 366 * 4 length for all years. It has to be fully populated and initialized before it is written to the byte output stream. It gets written to the byte output stream on line 51 when there is a year change. If we do it the way you suggest I have to put in other logic that outputs 4 zero valued bytes for all the days that are not in the TreeMap. More code and more clutter. The way I am doing it only requires log to process only the days of the year in the TreeMap and the other days are automatically zeroed. The gzip in Accumulo compaction compresses the zero days as we discussed.
I think it is good as it is right now. If you want me to code the extra logic for days that are not in the map then let me know and I will go ahead. I think this is pretty fast the way it is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well can we at least avoid reallocating a new array every time at line 60 and instead simply 0 our the array: Arrays.fill(dateFrequencies, (byte)0);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally see where you are going here. You are really safeguarding against a memory overflow and I have not had to be this careful in past work or I just got lucky. I will implement your suggestion. Thanks for working with me here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Array allocation change made and pushed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we could write directly to the baos. You have to zero out the array anyway so what is the difference? Just need to write zeros since the last day written when writing the next value.

}

private String calculateMMDD(int ordinal) {
if (ordinal > 366)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole method could be replaced with:

        SimpleDateFormat formatter = new SimpleDateFormat("MMdd");
        Calendar c = Calendar.getInstance();
        c.set(Calendar.DAY_OF_YEAR, i);
        return formatter.format(c.getTime());

If your ordinal is 0 based then you will want to set the day of year to i+1 instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes have been made. Thank for the lesson on Calendar class.

}

private String calculateMMDD(int ordinal) {
SimpleDateFormat formatter = new SimpleDateFormat("MMdd");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth creating this as a static member and then synchronizing on it when calling the format method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the function a static member, synchronized on formatter object. No longer create the object, just use static method. Pushed code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say that but I still see a formatter being created every time this method is called.....

private boolean isLeapYear = false;
private GregorianCalendar gregorianCalendar;

public static final int[] MONTH_LENGTH = new int[] {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these constants still needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look and remove possible construction debris.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the statics to the test class where they get used and removed construction debris.

int year, presentYear = 0;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] dayFrequencies = new byte[NUM_FREQUENCY_BYTES];
Arrays.fill(dayFrequencies, (byte) 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to fill with 0 just after the allocation. That is already done for you.


Value serializedMap;
int year, presentYear = 0;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could supply the size of the buffer estimated by looking at the year for the first and last entry in the passed in map.


try {
baos.write(Base256Compression.numToBytes(presentYear));
Arrays.fill(dayFrequencies, (byte) 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be done the first time through the loop


private void putFrequencyBytesInByteArray(Map.Entry<YearMonthDay,Frequency> entry, byte[] dayFrequencies) {
// This will always be 4 bytes now even if null it will get compressed later on.
byte[] frequency = Base256Compression.numToBytes(entry.getValue().value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this method can write directly to the array instead of allocating a new 4 byte array every time.

Copy link
Author

@jzgithub1 jzgithub1 Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array allocation is out of the code now. Will push soon.

src/main/java/datawave/query/util/DateFrequencyValue.java Outdated Show resolved Hide resolved
*/
for (int j = NUM_YEAR_BYTES; j < DAYS_IN_LEAP_YEAR * NUM_BYTES_PER_FREQ_VALUE + NUM_YEAR_BYTES; j += NUM_BYTES_PER_FREQ_VALUE) {
int k = i + j;
byte[] encodedfrequencyOnDay = new byte[] {expandedData[k], expandedData[k + 1], expandedData[k + 2], expandedData[k + 3]};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are still allocating these 4 byte arrays. Please create the method Base256Compression.bytesToInteger(expandedData, k) or Base256Compression.bytesToInteger(expandedData[k], expandedData[k+1], expandedData[k+2], expandedData[k+3]).

src/main/java/datawave/query/util/DateFrequencyValue.java Outdated Show resolved Hide resolved
src/main/java/datawave/query/util/DateFrequencyValue.java Outdated Show resolved Hide resolved
src/main/java/datawave/query/util/DateFrequencyValue.java Outdated Show resolved Hide resolved
src/main/java/datawave/query/util/MetadataHelper.java Outdated Show resolved Hide resolved
}

private String calculateMMDD(int ordinal) {
SimpleDateFormat formatter = new SimpleDateFormat("MMdd");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say that but I still see a formatter being created every time this method is called.....

@jzgithub1
Copy link
Author

The assumption that I made was that this version of the metadata-utils would be running only with the version of the main Datawave project that had the FrequencyColumnIterator in it and running at scan, minc, and majc. That is how the MetadataConfigHelper class in the main project sets that iterator up. If that is the case then the all of the "legacy" frequency data gets compacted with the new compression technique. I can make sure that the MetadataHelper function will work for sure by making calls to the old functions inside of the new functions if no compressed record is found.

@ivakegg
Copy link
Collaborator

ivakegg commented Nov 10, 2020

I do understand that. I guess it is ok to leave the metadata utils to only work with the new data format.

@jzgithub1
Copy link
Author

I do understand that. I guess it is ok to leave the metadata utils to only work with the new data format.

I agree with you. We should make the code bullet proof to support legacy.

jzgithub1 and others added 28 commits March 10, 2021 15:13
…requency column data with using the FrequencyTransformIterator
…ments.

* The frequency entries are now distinct by datatype
* The filtering withing the metadata helper is now pushed down to interators and into the deserialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants