Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDK-928: Utility to generate events to existing table. #23

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

DennisDawson
Copy link

With an eye toward modularization, I've repurposed CreateEvents.java from the Spark example and placed it in org/kitesdk/examples/data. This lets the customer create the events dataset using the CLI, then populate it with a substantial number of records using the Java utility. The same dataset can be used for the Flume and Spark examples, without having to delete them after running their respective jobs.

In GenerateEvents, I essentially swapped the CreateEvents create() method with load(). I added the Avro plug-in to pom.xml, copied the avro folder with standard_event.avscinto the main directory, and copied BaseEventsTool.java to org/kitesdk/examples/data.

In my environment, it compiles, runs, and populates the events table as expected.

**Update

The random records were a little too random: if the user_id, session_id, and ip are different each time, when the Crunch utility runs, there are no sessions to aggregate. I revised the run method to first generate the user_id, session_id, and ip, then used a for loop to generate 1-25 random events. I also modified the randomTimestamp method to increase the base length of time and add random padding to create more realistic session duration.

I'm happy to incorporate any changes that make the code more elegant, my changes just make it work.

@DennisDawson DennisDawson changed the title Utility to generate events to existing table. CDK-928: Utility to generate events to existing table. Feb 19, 2015
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.util.Tool;

public abstract class BaseEventsTool extends Configured implements Tool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like any of the code in this class is used, so it would be better to remove it and make GenerateEvents implement Tool directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. Done.

baseTimestamp = System.currentTimeMillis();

View<StandardEvent> events = Datasets.load(
(args.length==1 ? args[0] : "dataset:hive:events"), StandardEvent.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted this elsewhere, but I think it would be better to use a variable rather than the inline test here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this wrong, or just different? Are you suggesting that the test should set the variable before the load method? If the argument is invalid, does it change the result by setting it outside the load method? If the code must change before publication, please provide the acceptable alternate code, rather than have me guess at what I should do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants