Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Node updates queue files too slowly #3

Open
nylonoxygen77 opened this issue May 17, 2012 · 7 comments
Open

Worker Node updates queue files too slowly #3

nylonoxygen77 opened this issue May 17, 2012 · 7 comments

Comments

@nylonoxygen77
Copy link

The other day I tried to restore an entire project's worth of clips. There were 664 files. It was about 250-300GB of data. I have the restore queue interval set to 5 minutes. What happened was that Worker Node took too long to step through all 664 files - it got through about 35-45 of the clips, and then aw-queue.pl would run, submit those 35-45 files to PresStore, and it would continue. By the time all the clips were submitted to PresStore, there were about 21 restore jobs running at the same time, each with between 35 and 45 files to restore. All the clips were on the same volume in PresStore, so it seemed that PresStore was constantly trying to accommodate "waiting for volume xxxx to become available" requests. After about 4 hours, only 7 of those jobs had completed, so I told PresStore to cancel the remaining jobs. It took a while to cancel the jobs as well. By the time it was all said and done, after 4 hours, only about 222 files of the 664 were restored. In an ideal world, producers and editors would plan and schedule in advance, and when they needed to edit an archived project, it would be ready because they will have unarchived it with plenty of time. But that doesn't always happen, and usually they need it back more quickly. One of the reasons we went with PresStore was the fact that we knew we could archive and restore relatively quickly.

So the short term answer to this problem would be to increase the restore queue time to 10, 15, 20 minutes, or greater. That would help, but we'd still be likely to experience this phenomenon of more jobs being submitted to PresStore than we want (just 1 large restore job). And we'd have to wait for that restore interval before PresStore even started working.

So I did an experiment. I went into the CatDV catalog, made a new view that showed me only the media path, did a grouping of all offline files in this project (remaining 442 files), copied the media path column, and pasted it into the restore-queue.txt file. When aw-queue.pl ran, it submitted the remaining 442 files as one job. That job took only 50 minutes to complete.

So it seems like PresStore prefers having small numbers of large jobs, as opposed to large numbers of smaller jobs. To accommodate this, we could use the WN command line located at /Applications/CatDV Worker Node/catdv, do a query for the PresStore field (when set to Restore), output the media path, and redirect the output to the queue file. Then a second query of the same field would reset it back to blank. This accomplishes the task of sending all media paths to the queue file and also accomplishes the task of resetting the archive action in a matter of a few seconds. Here's the command I used (my User 30 field is the trigger field for Castor/PresStore):

catdv -query U30=Restore -print1 MF >> /usr/local/Castor/queues/restore-queue.txt

and to reset the archive action field:

catdv -query U30=Restore -set U30=

I tried accomplishing this in one query command but it didn't work, it does either the print1 or the set, depending on which one is first, but ignores the second one. So two lines it is.

So then, instead of using WN to monitor for Restore jobs, we would have to have a launchdaemon periodically run this command.

Now, for archive jobs, everything works great - I have the archive time set in org.provideotech.aw-queue-restore.plist to 2am. By that time even large projects have everything added to the queue. Also, there doesn't seem to be a quick way to export a CatDV XML file using the WN command line. You can use

catdv -query U30=Archive -print MF,CREF,U11 -format xml

But that will print out an XML batch file, which would have to be split into individual XML files, and you have to make sure you include all the fields you want in the -print option. So using the WN GUI seems to be the better way to go for archives.

But for quicker restores, the command line paired with a launchd might be a quicker route.
Thanks for your time!

@nylonoxygen77
Copy link
Author

Sorry, I meant this...
Now, for archive jobs, everything works great - I have the archive time set in org.provideotech.aw-queue-archive.plist to 2am.

@szumlins
Copy link
Owner

This makes perfect sense. The best way to work around this is likely to either move away from launchd (not really problematic) or more intelligently label and migrate queue files when a job starts. I think the latter will be more universally usable (as aw-queue.pl is sort of a standalone app outside of Castor).

I think the best way to accomplish this is to build some logic that datestamps a queue file at the start of an aw-queue call and cleans the file BEFORE the job runs. This way, if another restore call gets made before the first job is finished, the next call of aw-queue is going to reference the now empty (or possibly small) queue file.

Another way to work around this is to build a quick little widget on the desktop (I used Applescript) to call aw-queue with the proper variables on double click. This way, you could completely eliminate the need to have it launchd.

The final option would be to add another metadata field in CatDV called "submit restore" so that you could build an archive queue and then monitor that field to call your aw-queue script.

All great ideas here - now to find some time to implement them.

@nylonoxygen77
Copy link
Author

Good ideas. Here's another one -

Write a simple bash script using the CLI to output the queue file, reset the PresStore field, and call aw-queue:

/bin/bash

/Applications/CatDV\ Worker/catdv -query U30=Restore -print1 MF >> /usr/local/Castor/queues/restore-queue.txt
/Applications/CatDV\ Worker/catdv -query U30=Restore -set U30=
/usr/local/Castor/aw-queue.pl /usr/local/Castor/queues/restore-queue.txt restore archive_plan archive_index

To trigger this, you could do it a few different ways:

  1. In the restore action in WN, instead of running catdv2aw on each clip, trigger this script in the Parameters tab under "Batch completion: Execute Command 4." Also remove the Post-processing step of setting the field back to blank, or it won't work. All WN is doing for the files selected for restore is running the batch completion command. So WN will go through each clip and do nothing (taking 8 seconds for each clip)… but when it's done doing nothing, the script will run and it will generate a queue file and set the PresStore field for all clips back to blank in a matter of seconds, and trigger aw-queue to run the job.

  2. Add another way to submit the restore - like adding a checkbox or something. This is a bit awkward, because the user would have to make sure they only select it on one clip or WN would do it for how ever many clips the user selected. A complaint of mine about WN... there's not really a good way in WN to make it do something without relation to a clip. I can't even really say "Hey, WN, archive or restore this whole catalog, NOW!" I have to go clip-by-clip.

  3. (my favorite) Change the launchdaemon you have checking the restore queue to run this script instead of aw-queue directly. Forget having WN do it… If I want to restore my media ASAP, WN is just going to take too long to step through each clip. This is another problem that I have with WN, it really is pretty slow, even though it does a great job at what it does. But there's not a great way to process a batch of clips the same way, in an instant, like you can in the browser or with the CLI.

Most of the time, in an ideal world, it would be fine for WN to take its sweet old time to step through each clip, output each media paths, change each PresStore field, one by one until it's done. But I'm also thinking of the situations where an editor or producer forgets to check to make sure their media is online, they are booked for a block of 4 hours, and they need their media. But it's in the tape library. So now they have to restore it... but how long will it take? An hour, or four hours?

@nylonoxygen77
Copy link
Author

Mike,

The third method I suggested above works like a charm. A restore queue file is generated immediately and the field is reset moments later. This could drastically improve performance on restores.

Here's the script I used:

#! /bin/bash

/usr/local/Castor/aw-restore.sh

if [[ ! "$1" || ! "$2" || ! "$3" || ! "$4" || ! "$5" ]]; then
printf "\nusage: aw-restore.sh queue_file archive|restore archive_plan archive_index user_action_field\n\n";
printf "queue_file: full path to queue file.\n";
printf "archive|restore: select whether to archive or restore from archive the source_file.\n";
printf "archive_plan: which archive plan you want PresSTORE to use.\n";
printf "archive_index: this is the PresSTORE index we will look up our files from.\n\n";
printf "user_action_field: The field ID of the user field in CatDV that triggers adding the files to the queue.\n\n";
exit
fi

/Applications/CatDV\ Worker/catdv -query "$5"=Restore -print1 MF >> "$1"
/Applications/CatDV\ Worker/catdv -query "$5"=Restore -set "$5"=
/usr/local/Castor/aw-queue.pl "$1" "$2" "$3" "$4"

Ok... that said, I had to make some modifications to org.provideotech.aw-queue-restore.plist:

  1. the launchdaemon calls aw-restore.sh instead of aw-queue.pl
  2. I added an argument that's passed to this script for the field ID of the user field in CatDV (e.g. U30)
  3. you must include the UserName key and a user that has permission to run the catdv command. When you run the daemon as root, no user is specified, and the catdv command returns an error that it needs to be configured. This can probably be any user on the system, but I used the local administrator.

@szumlins
Copy link
Owner

I'm hoping to have time to test this out in the next few days and possibly incorporate into the master branch. If all works well I will update the configure.pl script to reference the worker batch commands vs doing the poll that slows everything down.

@nylonoxygen77
Copy link
Author

Cool! The only downside I see to this is that aw-queue.pl has to be on the same machine as Worker Node, but that is probably the case in a lot of installations. In my case, I had to completely change my configuration to make this work. I have Worker Node on one server, CatDV Server on another server, and PresStore on a third. I was running Castor on the PresStore server and had modified it so that the queue and temp files were being written to and read from our Xsan. Now Castor is running on the same server as Worker Node and I am using nsdchat -s awsock:/.

Which brings me to another issue... I will submit it separately though.

@nylonoxygen77
Copy link
Author

Another discovery today on this issue. I was having a problem where when I marked a large quantity of assets for archive in the evening, PresStore would get a whole bunch of separate archive jobs with only 1-2 files each. The first one was a significant number of files, but the rest were these little jobs. I discovered after examining the log that at 2AM, aw-queue would run, but then it would run again every minute until 3AM. During that time, WN would be continuing to add files to archive-queue, about 1-2 per minute. So PresStore would get flooded with small 1-2 file jobs every minute until 3AM. The solution, I believe, is to add a Minute key to the org.provideotech.aw-queue-archive.plist that's generated by the configure script. Will test this evening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants