You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have implemented a data architecture based on Pachyderm. Our first pipeline (called textblocks), having already processed a large number of files (27402 files), is completely blocked when we try to add a new file to be processed.
More specifically, the strange behaviour observed is as follows :
When we start the "textblocks" pipeline (pachctl start pipeline textblocks) we have all the pods created and the pipeline goes into running status.
pachctl list pipeline
NAME VERSION INPUT CREATED STATE / LAST JOB DESCRIPTION
textblocks 1 dump_file:/**.pdf 5 weeks ago running / success description
kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-0 1/1 Running 0 26h
pachd-55f54bb966-ntfhk 1/1 Running 0 26h
pg-bouncer-7b855cb797-zzj4q 1/1 Running 0 26h
pipeline-textblocks-v1-98tjl 2/2 Running 0 126m
postgres-0 1/1 Running 0 26h
The pipeline, having already processed the files that have been put in for a while, automatically starts to run and is quickly put in success status, as no new files are to be processed - so far so good.
With the pipeline still on, we add just one new file to process, with a port-forward enabled. Then the pipeline goes back into status run, and after a few moments goes into failure status. Looking at the available pods logs, we see a somewhat specific error, which seems to have caused the pipeline to fail to process. This error appears to be "error":"file / directory path collision (/PARIS/2019/PARIS_2019_02_5fOdx.pdf)".
Although I have tried to delete the file (/PARIS/2019/PARIS_2019_02_5fOdx.pdf) cited in the error in the logs, this does not seem to resolve the pipeline. Indeed, a new log appears, similar to the previous one, but specifying another file name.
Having checked, however, we have no duplicate files that have been added to the architecture. Therefore, no path collision errors should appear, in all logic.
What you expected to happen?:
That the pipeline runs smoothly and processes the newly added file without any problems, and switches to success status.
How to reproduce it (as minimally and precisely as possible)?:
No idea ... really sorry about that.
Thank you @amandinesoub for opening this issue. I see you are currently on an unsupported version of pachyderm (2.0.6) as a first step can you please upgrade to the latest version of pachyderm (2.4.0).
You should not have to run pachctl start pipeline either. When you put new data into the input repo the pipeline will automatically start.
What happened?:
We have implemented a data architecture based on Pachyderm. Our first pipeline (called textblocks), having already processed a large number of files (27402 files), is completely blocked when we try to add a new file to be processed.
More specifically, the strange behaviour observed is as follows :
pachctl start pipeline textblocks
) we have all the pods created and the pipeline goes into running status.The pipeline, having already processed the files that have been put in for a while, automatically starts to run and is quickly put in success status, as no new files are to be processed - so far so good.
Although I have tried to delete the file (/PARIS/2019/PARIS_2019_02_5fOdx.pdf) cited in the error in the logs, this does not seem to resolve the pipeline. Indeed, a new log appears, similar to the previous one, but specifying another file name.
Having checked, however, we have no duplicate files that have been added to the architecture. Therefore, no path collision errors should appear, in all logic.
What you expected to happen?:
That the pipeline runs smoothly and processes the newly added file without any problems, and switches to success status.
How to reproduce it (as minimally and precisely as possible)?:
No idea ... really sorry about that.
Environment?:
kubectl version
):pachctl version
):Cloud provider (e.g. aws, azure, gke) or local deployment (e.g. minikube vs dockerized k8s):
GKE
If you deployed with helm, the values you used (
helm get values pachyderm
):Debian
Your help would be greatly appreciated. Thank you in advance!
The text was updated successfully, but these errors were encountered: