Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This is the second part of Datawave #716 issue. #2

Open
wants to merge 80 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
bd56304
Created new class to contain code for serializing and deserializing c…
jzgithub1 Jun 8, 2020
32edc40
Created new class to contain code for serializing and deserializing c…
jzgithub1 Jun 8, 2020
3d0ee3d
Created new class to contain code for serializing and deserializing c…
jzgithub1 Jun 8, 2020
c1fe5f0
Added comments to the FrequencyFamilyCounter
jzgithub1 Jun 9, 2020
ec4e8c7
Restored the .gitignore
jzgithub1 Jun 9, 2020
874d01c
Only using simple date format in compress column record value
jzgithub1 Jun 9, 2020
893fdc7
Modified the getCardinalityForField in the MetadataHelper class
jzgithub1 Jun 9, 2020
9b0b55b
Recusively remove target and idea directories
jzgithub1 Jun 9, 2020
9d806ad
git rm metadata-utils.iml
jzgithub1 Jun 9, 2020
00214be
Changed parent project version back to 1.3 to get rid of Javadoc comp…
jzgithub1 Jun 10, 2020
f4e6aad
Added an empty line to a file to allow a new commit to force rebuild …
jzgithub1 Jun 10, 2020
6084897
Restored parent project to original version in the pom.xml
jzgithub1 Jun 10, 2020
e74600e
Trying to fix issues Jenkins on git is complaining about.
jzgithub1 Jun 10, 2020
6b51391
Added a static constant member variable SIMPLEDATE_LENGTH
jzgithub1 Jun 10, 2020
649be4d
Allowing for hexadecimal values to be added correctly
jzgithub1 Jun 11, 2020
24c889d
Added comments to the DateFrequencyValue class.
jzgithub1 Jun 11, 2020
c0ce97f
Compression implementation for date-frequency pairs is 90% complete
jzgithub1 Jun 12, 2020
c190e83
Working on compression algos
jzgithub1 Jun 15, 2020
1e1aa6c
Added Base256Compression class and a test class
jzgithub1 Jun 18, 2020
aff420d
Working on the compression classes and tests
jzgithub1 Jun 19, 2020
fef05a6
Serialization update
jzgithub1 Jun 19, 2020
6667299
Compression being implemented
jzgithub1 Jun 22, 2020
02f3ad4
Fixed an complilation error
jzgithub1 Jun 22, 2020
f649778
Fixed an indexing bug
jzgithub1 Jun 23, 2020
15b8eba
Modified numToBytes
jzgithub1 Jun 23, 2020
8bc95fe
Added .equals() to YearKey class
jzgithub1 Jun 23, 2020
e140dd8
Coded for the YYYYDDMM format
jzgithub1 Jun 23, 2020
996fd47
Increased size of read buffer in deserialize to the correct size.
jzgithub1 Jun 23, 2020
a5d5f5a
Serialize,Deserialize perfected with OrdinalDayofYear class
jzgithub1 Jun 24, 2020
02cfd95
Made suggested changes to bytesToInteger
jzgithub1 Jun 24, 2020
e311063
Working on perfecting OrdinalDayOfYear class
jzgithub1 Jun 25, 2020
f9d58e0
Improving logic in OrdinalDayOfYear class
jzgithub1 Jun 25, 2020
b06608d
Serialization and Deserialization possibly completely debugged. Need …
jzgithub1 Jun 25, 2020
eae7e1c
10 years of frequencies serialize and deserialize properly
jzgithub1 Jun 26, 2020
d44c0bc
Added a test that creates and Accuumlo shell script with path /var/tm…
jzgithub1 Jun 26, 2020
20cd4c0
Refactored functions in OrdinalDayOfYear class
jzgithub1 Jun 29, 2020
64f6af0
Correctled num days in August from 30 to 31 - It caused an error
jzgithub1 Jun 29, 2020
959854e
Perfected serialization and deserialization for 10 year period with l…
jzgithub1 Jun 30, 2020
8d61868
Fixed a comment
jzgithub1 Jun 30, 2020
c7c8229
Fixed a comment again
jzgithub1 Jun 30, 2020
9d03e8e
Cleaned up DateFrequencyValueTest - it had too much hardcoding
jzgithub1 Jul 2, 2020
6eb67f8
Working on a RunLengthEncoder for null bytes in the byte array that b…
jzgithub1 Jul 2, 2020
2fad13d
Removed Gzip code
jzgithub1 Jul 7, 2020
a8720ee
Don't print out call stack in Info log message
jzgithub1 Jul 9, 2020
e3a5b19
A little code cleanup
jzgithub1 Jul 14, 2020
94e4cc7
modified getCountsByFieldInDayWithTypes function for frequency column…
jzgithub1 Jul 20, 2020
ef6d919
Added script to insert data in the project
jzgithub1 Jul 21, 2020
cbf03a1
Updated test script
jzgithub1 Jul 21, 2020
2c85d6b
Improved MetadataHelper
jzgithub1 Jul 21, 2020
6fd5e09
Removed some info logging
jzgithub1 Jul 31, 2020
6dc6205
Using GregorianCalendar and TreeMaps
jzgithub1 Aug 11, 2020
335f759
Changed comment and removed unuse code
jzgithub1 Aug 11, 2020
f962d6f
Improved calculation of MMDD in OrdinalDayOfYear class
jzgithub1 Aug 13, 2020
bd39584
Avoid unneeded array realocation
jzgithub1 Aug 13, 2020
97211f0
Moved static constants to test class, cleaned up construction debris
jzgithub1 Aug 13, 2020
9d83d4a
Made calculateMMDD a static member and synchronized on calling format…
jzgithub1 Aug 13, 2020
5a4dc4b
Removed array reinitializations that were unneeded
jzgithub1 Aug 17, 2020
c4267b1
No longer preallocating arrays for datefrequencies and making adjustm…
jzgithub1 Aug 18, 2020
633cb53
Added more tests
jzgithub1 Aug 19, 2020
ace078a
Improved code
jzgithub1 Aug 19, 2020
088eb01
Added function to calculate the size of the ByteArrayOutputStream object
jzgithub1 Aug 20, 2020
a98568e
Removed byte array allocation and function that needed them.
jzgithub1 Aug 20, 2020
19e7996
Added javadoc comments and uneeded object copying.
jzgithub1 Aug 20, 2020
8c6ddb4
Added requested comments
jzgithub1 Sep 14, 2020
ea67643
Made MetadataHelper::getCountsByFieldInDayWithTypes backward compatible
jzgithub1 Sep 14, 2020
cb34cb6
Made the formatter a static member object
jzgithub1 Sep 14, 2020
f7b293d
Modified accumulo shell test script
jzgithub1 Sep 15, 2020
6bb7ad2
Fixed cardinality for field function
jzgithub1 Sep 16, 2020
aa6f8ee
Made MetadataHelper.getCountsByFieldInDayWithTypes work with legacy f…
jzgithub1 Sep 16, 2020
3856134
Updated AllFieldMetadataHelper.isIndex to work with compressed index …
jzgithub1 Oct 26, 2020
552d832
Correct trace comments regarding parsing integer count values
jzgithub1 Nov 9, 2020
d7792b3
Correct trace comments regarding parsing integer count values
jzgithub1 Nov 9, 2020
e296160
Added getCardinalityForFieldLegacy to get called if aggregated freque…
jzgithub1 Nov 10, 2020
6db91b3
re #716: Updated to handle datatype correctly and performance enhance…
ivakegg Nov 12, 2020
b9db45c
re #716: Updated to handle edge cases
ivakegg Nov 13, 2020
eb96f47
Removed frequencyColumnCompressionData.txt test file
jzgithub1 Nov 13, 2020
f9c60a2
re #716: Updated the date frequency parsing to only parse those in th…
ivakegg Nov 13, 2020
fb54d4b
re #716: Ensure we are not affected by time zone. Also appropriate i…
ivakegg Nov 16, 2020
ce31178
re #716: reduce test output to console
ivakegg Nov 16, 2020
aabd21a
Created a non-snapshot version
ivakegg Mar 10, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
<relativePath />
</parent>
<artifactId>metadata-utils</artifactId>
<version>1.8-SNAPSHOT</version>
<version>1.8-zeiberg-716-1.0</version>
<url>https://code.nsa.gov/datawave-metadata-utils</url>
<licenses>
<license>
Expand All @@ -19,7 +19,7 @@
<scm>
<connection>scm:git:https://github.com/NationalSecurityAgency/datawave-metadata-utils.git</connection>
<developerConnection>scm:git:git@github.com:NationalSecurityAgency/datawave-metadata-utils.git</developerConnection>
<tag>HEAD</tag>
<tag>1.8-zeiberg-716-1.0</tag>
<url>https://github.com/NationalSecurityAgency/datawave-metadata-utils</url>
</scm>
<properties>
Expand Down
268 changes: 268 additions & 0 deletions src/main/java/datawave/query/util/DateFrequencyValue.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
package datawave.query.util;

import com.google.common.base.Preconditions;
import org.apache.accumulo.core.data.Value;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.*;
import java.time.Year;
import java.util.Map;
import java.util.TreeMap;

/**
* This class handles the serialization and deserialization of the Accumulo value in a record of the Datawave Metadata table that has a column family of "f" and
* a column qualifier that is prefixed with the string "compressed-" like "compressed-csv" for example. This is a class used to help compress the date and
* frequency values that are aggregated together to by the FrequencyTransformIterator and manipulated in the FrequencyFamilyCounter. The byte array is in
* regular expression format ((YEAR)(4BYTE-FREQUENCY){366}))* . Explained verbally a four byte representation of Year followed by by up to 366 (Leap year) 4
* byte holders for frequency values. The month and day of the frequency value is coded by the position in the array. There aren't any delimiters between years
* and frequencies which adds to the compression. Each Accumulo row for this "aggregated" frequency "map" would be 10 x ( 4 + 4 + (366 * 4) ) bytes long for a
* maximum length for a 10 year capture of 14720 bytes.
*/

public class DateFrequencyValue {

private static final Logger log = LoggerFactory.getLogger(DateFrequencyValue.class);

private static int DAYS_IN_LEAP_YEAR = YearMonthDay.DAYS_IN_LEAP_YEAR;
private static int NUM_YEAR_BYTES = 4;
private static int NUM_BYTES_PER_FREQ_VALUE = 4;
private static int NUM_FREQUENCY_BYTES = DAYS_IN_LEAP_YEAR * NUM_BYTES_PER_FREQ_VALUE;

public DateFrequencyValue() {}

/**
* @param dateToFrequencyValueMap
* the keys should be dates in yyyyMMdd format
* @return size the size of the ByteArrayOutputStream
*/
private int calculateOutputArraySize(TreeMap<YearMonthDay,Frequency> dateToFrequencyValueMap) {
int firstYear = dateToFrequencyValueMap.firstKey().getYear();
int lastYear = dateToFrequencyValueMap.lastKey().getYear();
return (lastYear - firstYear + 1) * (NUM_YEAR_BYTES + NUM_FREQUENCY_BYTES);
}

/**
*
* Serializes a treemap of dates with associated frequencies to an Accumulo value
*
* @param dateToFrequencyValueMap
* the keys should be dates in yyyyMMdd format
* @return Value the value to store in accumulo
*/
public Value serialize(TreeMap<YearMonthDay,Frequency> dateToFrequencyValueMap) {

Value serializedMap;
int year, presentYear = 0;
ByteArrayOutputStream baos = new ByteArrayOutputStream(calculateOutputArraySize(dateToFrequencyValueMap));
int ordinal, nextOrdinal = 1;

for (Map.Entry<YearMonthDay,Frequency> dateFrequencyEntry : dateToFrequencyValueMap.entrySet()) {
year = dateFrequencyEntry.getKey().getYear();
ordinal = dateFrequencyEntry.getKey().getJulian();

if (year != presentYear) {

if (nextOrdinal > 1 && (nextOrdinal <= DAYS_IN_LEAP_YEAR)) {
// Add zero frequencies for the remaining days of the year that were not in the map
// for the last year processed
do {
Base256Compression.writeToOutputStream(0, baos);
nextOrdinal++;
} while (nextOrdinal <= DAYS_IN_LEAP_YEAR);
}

Base256Compression.writeToOutputStream(year, baos);
nextOrdinal = 1;
presentYear = year;
}

if (ordinal == nextOrdinal) {
Base256Compression.writeToOutputStream(dateFrequencyEntry.getValue().value, baos);
nextOrdinal++;
} else {
do {
Base256Compression.writeToOutputStream(0, baos);
nextOrdinal++;
} while (nextOrdinal < ordinal);

Base256Compression.writeToOutputStream(dateFrequencyEntry.getValue().value, baos);
nextOrdinal++;

}

if (log.isTraceEnabled())
log.trace(dateFrequencyEntry.getKey().toString());

}

do {
Base256Compression.writeToOutputStream(0, baos);
nextOrdinal++;
} while (nextOrdinal <= DAYS_IN_LEAP_YEAR);

serializedMap = new Value(baos.toByteArray());

return serializedMap;
}

/**
* Deserializes the Accumulo Value object which contains a byte array into a TreeMap of dates to the associated ingest frequencies.
*
* @param oldValue
* @return
*/
public TreeMap<YearMonthDay,Frequency> deserialize(Value oldValue) {
ivakegg marked this conversation as resolved.
Show resolved Hide resolved
return deserialize(new TreeMap(), oldValue, null, false, null, false);
}

/**
* Deserializes the Accumulo Value object which contains a byte array into a TreeMap of dates to the associated ingest frequencies.
*
* @param oldValue
* @param startDate
* yyyyMMdd start date to deserialize, null means no bound
* @param startInclusive
* @param endDate
* yyyyMMdd end date to deserialize, null means no bound
* @param endInclusive
* @return
*/
public TreeMap<YearMonthDay,Frequency> deserialize(Value oldValue, String startDate, boolean startInclusive, String endDate, boolean endInclusive) {
return deserialize(new TreeMap(), oldValue, startDate, startInclusive, endDate, endInclusive);
}

/**
* Deserializes and aggregates the Accumulo Value object which contains a byte array into a TreeMap of dates to the associated ingest frequencies.
*
* @param dateFrequencyMap
* @param oldValue
* @param startDate
* yyyyMMdd start date to deserialize, null means no bound
* @param startInclusive
* @param endDate
* yyyyMMdd end date to deserialize, null means no bound
* @param endInclusive
* @return
*/
public TreeMap<YearMonthDay,Frequency> deserialize(TreeMap<YearMonthDay,Frequency> dateFrequencyMap, Value oldValue, String startDate,
boolean startInclusive, String endDate, boolean endInclusive) {
Preconditions.checkNotNull(oldValue);

byte[] expandedData = oldValue.get();

if (expandedData.length == 0) {
return dateFrequencyMap;
}

YearMonthDay.Bounds bounds = new YearMonthDay.Bounds(startDate, startInclusive, endDate, endInclusive);

try {
for (int i = 0; i < expandedData.length; i += (NUM_YEAR_BYTES + NUM_FREQUENCY_BYTES)) {
int decodedYear = Base256Compression.bytesToInteger(expandedData[i], expandedData[i + 1], expandedData[i + 2], expandedData[i + 3]);
String decodedYearStr = String.valueOf(decodedYear);
log.debug("Deserialize decoded the year " + decodedYearStr);

if (bounds.intersectsYear(decodedYearStr)) {
// determine the range within this year
int startOrdinalInYear = bounds.getStartOrdinal(decodedYearStr);
int endOrdinalInYear = bounds.getEndOrdinal(decodedYearStr);

// Decode the frequencies for each day of the year within the range.
for (int j = NUM_YEAR_BYTES + ((startOrdinalInYear - 1) * NUM_BYTES_PER_FREQ_VALUE); j < endOrdinalInYear * NUM_BYTES_PER_FREQ_VALUE
+ NUM_YEAR_BYTES; j += NUM_BYTES_PER_FREQ_VALUE) {
int k = i + j;
YearMonthDay date = new YearMonthDay(decodedYear, ((j - NUM_YEAR_BYTES) / NUM_BYTES_PER_FREQ_VALUE) + 1);
int decodedFrequencyOnDay = Base256Compression.bytesToInteger(expandedData[k], expandedData[k + 1], expandedData[k + 2],
expandedData[k + 3]);
if (decodedFrequencyOnDay != 0) {
if (dateFrequencyMap.containsKey(date)) {
dateFrequencyMap.get(date).addFrequency(decodedFrequencyOnDay);
} else {
dateFrequencyMap.put(date, new Frequency(decodedFrequencyOnDay));
}
log.debug("put key value pair in SimpleDateFrequency map: " + date + ", " + dateFrequencyMap.get(date));
}
}
}
}
} catch (IndexOutOfBoundsException indexOutOfBoundsException) {
log.error("Error decoding the compressed array of date values. Expanded array length: " + expandedData.length, indexOutOfBoundsException);
throw new IllegalStateException("Error decoding the compressed array of date values. Expanded array length: " + expandedData.length,
indexOutOfBoundsException);
}

return dateFrequencyMap;
}

/**
* Sums the frequencies of the Accumulo Value object which contains a byte array into a TreeMap of dates to the associated ingest frequencies.
*
* @param oldValue
* @param startDate
* yyyyMMdd start date to deserialize, null means no bound
* @param startInclusive
* @param endDate
* yyyyMMdd end date to deserialize, null means no bound
* @param endInclusive
* @return
*/
public long count(Value oldValue, String startDate, boolean startInclusive, String endDate, boolean endInclusive) {
Preconditions.checkNotNull(oldValue);

byte[] expandedData = oldValue.get();

if (expandedData.length == 0) {
return 0;
}

YearMonthDay.Bounds bounds = new YearMonthDay.Bounds(startDate, startInclusive, endDate, endInclusive);

try {
long count = 0;
for (int i = 0; i < expandedData.length; i += (NUM_YEAR_BYTES + NUM_FREQUENCY_BYTES)) {
int decodedYear = Base256Compression.bytesToInteger(expandedData[i], expandedData[i + 1], expandedData[i + 2], expandedData[i + 3]);
String decodedYearStr = String.valueOf(decodedYear);
log.debug("Deserialize decoded the year " + decodedYearStr);

if (bounds.intersectsYear(decodedYearStr)) {
// determine the range within this year
int startOrdinalInYear = bounds.getStartOrdinal(decodedYearStr);
int endOrdinalInYear = bounds.getEndOrdinal(decodedYearStr);

// Decode the frequencies for each day of the year within the range.
for (int j = NUM_YEAR_BYTES + ((startOrdinalInYear - 1) * NUM_BYTES_PER_FREQ_VALUE); j < endOrdinalInYear * NUM_BYTES_PER_FREQ_VALUE
+ NUM_YEAR_BYTES; j += NUM_BYTES_PER_FREQ_VALUE) {
int k = i + j;
YearMonthDay date = new YearMonthDay(decodedYear, ((j - NUM_YEAR_BYTES) / NUM_BYTES_PER_FREQ_VALUE) + 1);
count += Base256Compression.bytesToInteger(expandedData[k], expandedData[k + 1], expandedData[k + 2], expandedData[k + 3]);
}
}
}
return count;
} catch (IndexOutOfBoundsException indexOutOfBoundsException) {
log.error("Error decoding the compressed array of date values. Expanded array length: " + expandedData.length, indexOutOfBoundsException);
throw new IllegalStateException("Error decoding the compressed array of date values. Expanded array length: " + expandedData.length,
indexOutOfBoundsException);
}
}

/**
* This is a helper class that will compress the yyyyMMdd and the frequency date concatenated to it without a delimiter
*/
public static class Base256Compression {

public static void writeToOutputStream(long num, ByteArrayOutputStream baos) {
baos.write((byte) (num >>> 24));
baos.write((byte) (num >>> 16));
baos.write((byte) (num >>> 8));
baos.write((byte) (num));
}

public static int bytesToInteger(byte high, byte nextHigh, byte lowbyte, byte lowest) {
return ((int) high & 0xff) << 24 | ((int) nextHigh & 0xff) << 16 | ((int) lowbyte & 0xff) << 8 | ((int) lowest & 0xff);
}

}

}
36 changes: 36 additions & 0 deletions src/main/java/datawave/query/util/Frequency.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
package datawave.query.util;

public class Frequency {
int value;

public Frequency(int value) {
this.value = value;
}

public int getValue() {
return value;
}

void addFrequency(int value) {
this.value += value;
}

@Override
public String toString() {
return "Frequency{" + "value=" + value + '}';
}

@Override
public int hashCode() {
return value;
}

@Override
public boolean equals(Object obj) {
if (obj instanceof Frequency) {
if (((Frequency) obj).value == value)
return true;
}
return false;
}
}