Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corb2 needs be restartable #65

Open
bbandlamudi opened this issue Jun 29, 2017 · 6 comments
Open

Corb2 needs be restartable #65

bbandlamudi opened this issue Jun 29, 2017 · 6 comments
Assignees
Milestone

Comments

@bbandlamudi
Copy link
Contributor

We need to figure out a way to make corb restartable. We achieve this with some workarounds using the control documents, but very few batch operations rely on control docs. We can potentially write the processed uris to a local file and filter them out when the job is restarted.

@bbandlamudi bbandlamudi self-assigned this Jun 29, 2017
@hansenmc
Copy link
Member

hansenmc commented Jun 30, 2017

It's a good idea. Other workarounds, such as adding docs to a CORB job specific collection add extra overhead. It would be nice to have a client-only mechanism and an easy option to enable for jobs.

Things to consider:

  • If we track the processed URIs, it should probably spill over to disk, in order to avoid memory issues
  • OPTIONS to specify a specific path and/or filename
    • Whether to just configure the directory and compute a filename (maybe with a hash of the options)
    • If we simply enabled the "restartable" feature with a boolean flag, we could use auto-magic defaults (java temp dir and computed options hash), but also allow for specific overrides of the directory and/or filename.
  • How to efficiently scan the file and do the existence check.
    • Maybe using the DiskQueue to ensure the top X URIs are read in from disk and are popped off the queue if the URI matches.
    • In order to avoid contention and issues managing that file for both filtering and tracking new URIs, when the job starts if that URI tracking file exists (from previous run) then it is copied to a new file and used for filtering
    • As new URIs are processed they can simply be appended to the original tracking file.
  • It would also be a convenient report of processed URIs. Maybe have a flag to know whether it should be retained or removed when the job completes?

@bbandlamudi
Copy link
Contributor Author

Good points. I am edging towards the idea of expanding on the disk-queue option and make it a requirement if the job needs to be restartable - well, unless we are using file loader, streaming loader etc. We probably need a simple marker (file) to keep track of indexes of uris already processed or indexes of uris currently in-process when the job was killed, which ever is efficient. We may be able to use this information filter out what is already processed and what remains to be processed if the job is restarted. If the job started with a tracking file along with temp-file (if not uris-file), then we can assume it is a restart.

So, instead of temp file with delete on exit, we may be able to change to delete on 'clean' exit. We need a way to report back (as errors) if these files are left undeleted when job was killed, but not restarted.

Well, this is not an easy problem to solve and any batch op, that does this has to track this processing information in a file or db.

@bbandlamudi
Copy link
Contributor Author

I am thinking of using using a parameter similar to URIS-COMPLETED-FILE (or something a bit more obvious), to which completed URIS will be written by the Monitor class, which is where we track completed URIS.

The challenge I have is to make this transparent to the users i.e., if the parameter is specified, the job should always be restartable without user intervention (as it can/often be done by the scheduler) during the restart i.e., not forcing user to rename files (the user can do it if he/she wants to), update the parameters or use a combination of parameters. I am not sure how to do this yet (looking for ideas?), but I am thinking if the uris completed file exists, use it before the start of 'restart' job and move the previous file aside so that the file name can be used to track newly completed uris. May be we can append timestamp to the previous uris completed file name etc., to avoid being overwritten.

@bbandlamudi
Copy link
Contributor Author

bbandlamudi commented Nov 16, 2017

@hansenmc @vjsaradhi - Please comment if you have an ideas or find mistakes in my approach. Also, need to think how we should do this for the new loaders we added for v2.4.0.

Is there a way to diff of very large files in java. In our case, we need to find which of the uris from original uris are missing in completed uris file. I couldn't find any reliable open source implementation.

I am wondering we could write it ourselves by - this only works since both file are sorted and completed uris file is always a subset of original uris file. I will need to experiment this on very large files.

  1. Sort both original uris and completed uris files. using external sorting library that we have.
  2. We only need to move the pointer forward i.e., read-next on each of the files.
  3. For line 1 in original file - check if this corresponds to line 1 in completed file. If found, we are good.. If not found, write to diff file and keep the completed uris pointer at line 1 for next iteration
  4. a) If line 1 is not present in completed file, then line 2 is checked against line 1 of completed file. b) if line 1 is present in completed file, then line 2 of original file is checked against line 2 of completed file.
  5. Steps 3 and 4 are repeated for every line in the original file.

Question: How can we implement restartability on the new loaders i.e., streaming, zip, directory loaders? I am hoping these are not going to be difficult, but I will keep implementation towards the end.

Note: Once the diff is done, we can restart corb with missing uris as as URIS-FILE. My big concern is how to get the restartability working seamlessly with out much user intervention.

@mikeburl
Copy link

bump

I have a customer who would be interested in using this feature. For now they're just restarting the entire job. This would be a very useful feature for users who have large and complicated collection processes

@bbandlamudi
Copy link
Contributor Author

@mikeburl - sorry for delay. This is has been a little tricky to implement, though the basic idea is simple i.e, keep track of uris already processed and skip them if restarted. I will try to get to it in near future. For the time being, we have pursuing alternate ways in our current project i.e., 1. For update jobs, we are either tagging processed docs with a collection or updating a field, so next time we run the job, it won't pick up already updated docs. 2. For read only jobs, split corb into two parts i.e, use module executor to run the selector and dump uris to a file and then use URIS-FILE option to run the transform, which writes processed uris to a file. If the job needs to be restarted, we can do a delta between to figure out uris that haven't yet been processed. This could be automated via shell script. For a longer term solution, we could probably make this second option built into corb itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants