Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore fails if a value in the deployment has colon #370

Closed
AndrewSav opened this issue Jun 4, 2021 · 14 comments · Fixed by #404
Closed

Restore fails if a value in the deployment has colon #370

AndrewSav opened this issue Jun 4, 2021 · 14 comments · Fixed by #404
Assignees
Labels
component:operator type:bug Something isn't working

Comments

@AndrewSav
Copy link

AndrewSav commented Jun 4, 2021

ISSUE TYPE
  • Bug Report
SUMMARY

In awx object spec I have ingress_annotations: 'cert-manager.io/cluster-issuer: letsencrypt', the value that has a colon. In such a case the awx object does not serialize correctly and restore fails with "unknown playbook failure".

ENVIRONMENT
  • AWX version: 19.2.0
  • Operator version: 0.10.0
  • Kubernetes version: v1.21.0+k3s1
  • AWX install method: kubernetes
STEPS TO REPRODUCE

Backup:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
  name: awxbackup-manual-2021-06-04-01
spec:
  deployment_name: awx
  postgres_label_selector: app.kubernetes.io/instance=postgres-awx
  backup_pvc: manual-claim

Restore:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
  name: awxrestore-init
spec:
  deployment_name: awx
  backup_pvc: manual-claim
  backup_dir: '/backups/tower-openshift-backup-2021-06-04-11:42:15'
  backup_pvc_namespace: default
EXPECTED RESULTS

Restore should succeed

ACTUAL RESULTS

Restore fails with "unknown playbook failure".

ADDITIONAL INFORMATION

This is what persisted awx_object looks like:

{admin_user: admin, api_version: awx.ansible.com/v1beta1, create_preload_data: True, deployment_type: awx, garbage_collect_secrets: False, hostname: nevermind.domain.tld, image_pull_policy: IfNotPresent, ingress_annotations: cert-manager.io/cluster-issuer: letsencrypt, ingress_tls_secret: awx-le-certificate, ingress_type: Ingress, kind: AWX, loadbalancer_port: 80, loadbalancer_protocol: http, projects_persistence: False, projects_storage_access_mode: ReadWriteMany, projects_storage_size: 8Gi, replicas: 1, route_tls_termination_mechanism: Edge, task_privileged: False}

Note how there is no quotes around cert-manager.io/cluster-issuer: letsencrypt. The reason for this is that this line does not quite work as the author intended.

Suggested fix:

bash -c 'echo "$0" > {{ backup_dir }}/awx_object' {{ awx_spec|quote }}
AWX-OPERATOR LOGS

Can be found here: https://gist.github.com/AndrewSav/e9e73d9b4ab19341bf4926707ec52540

@rooftopcellist rooftopcellist added priority:medium type:bug Something isn't working and removed priority:medium labels Jun 8, 2021
@WeetA34
Copy link

WeetA34 commented Jun 9, 2021

Hello
it fails too if you defined extra envs in task_extra_env and/or web_extra_env.

the following awx definition:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
spec:
  admin_email: xxx.yyy@domain.tld
  admin_user: admin
  hostname: host.domain.tld
  ingress_tls_secret: xxx-tls
  ingress_type: Ingress
  task_extra_env: |
    - name: http_proxy
      value: "http://proxy.domain.tld:3128"
    - name: https_proxy
      value: "http://proxy.domain.tld:3128"
    - name: no_proxy
      value: "localhost,127.0.0.1,.domain.tld,.cluster.local"
  web_extra_env: |
    - name: http_proxy
      value: "http://proxy.domain.tld:3128"
    - name: https_proxy
      value: "http://proxy.domain.tld:3128"
    - name: no_proxy
      value: "localhost,127.0.0.1,.domain.tld,.cluster.local"

results to the following backup's awx_object:

{admin_email: xxx.yyy@domain.tld, admin_user: admin, api_version: awx.ansible.com/v1beta1, create_preload_data: True, deployment_type: awx, garbage_collect_secrets: False, hostname: host.domain.tld, image_pull_policy: IfNotPresent, ingress_tls_secret: xxx-tls, ingress_type: Ingress, kind: AWX, loadbalancer_port: 80, loadbalancer_protocol: http, projects_persistence: False, projects_storage_access_mode: ReadWriteMany, projects_storage_size: 8Gi, replicas: 1, route_tls_termination_mechanism: Edge, task_extra_env: - name: http_proxyn value: http://proxy.domain.tld:3128/n- name: https_proxyn value: http://proxy.domain.tld:3128n- name: no_proxyn value: localhost,127.0.0.1,.domain.tld,.cluster.localn, task_privileged: False, web_extra_env: - name: http_proxyn value: http://proxy.domain.tld:3128/n- name: https_proxyn value: http://proxy.domain.tld:3128/n- name: no_proxyn value: localhost,127.0.0.1,.domain.tld,cluster.localn}

As you can see, task_extra_env and web_extra_env is not properly defined:

  • \ of new lines have been removed
  • yaml indentation is lost

Regards

@AndrewSav
Copy link
Author

Another issue is that templating a dictionary like this does not necessarily results in a valid yaml. E.g.

- set_fact:
    bla:
      extra_volumes: |
        - name: myvolume
          persistentVolumeClaim:
            claimName: manual-claim
- template:
    src: test.yml
    dest: /full/path/to/written.txt

where test.yml has {{ bla }}

Will template into 'extra_volumes': '- name: myvolume\n persistentVolumeClaim:\n claimName: manual-claim\n which is not an equivalent yaml, so it will fail.

Additionally, this line also can cause problems: the documentation says "If you need variable interpolation in copied files, use the ansible.builtin.template module. Using a variable in the content field will result in unpredictable output." And indeed, if you try to write {'a':'b'} to a file with content from a variable {"a": "b"} gets written instead. I'm not sure that this particular one is causing problems right now, but this use of content is better be avoided to make sure we do not have problems going forward.

@AndrewSav
Copy link
Author

AndrewSav commented Jun 17, 2021

That does not fix the second scenario when the spec contains \n as described above should I open a separate issue? @Spredzy

@Spredzy
Copy link
Collaborator

Spredzy commented Jun 17, 2021

Sorry, I prob missed it, I got focus on top message. Please do. I'll work on it.

@AndrewSav
Copy link
Author

I cannot open the issue right now, because with the latest release there is a new issue that is preventing me from reliable testing.

Please let me know if anything is unclear in this issue I'll add more details if that's the case.

@Spredzy
Copy link
Collaborator

Spredzy commented Jun 18, 2021

@AndrewSav a fix has landed for the above issue. (extra space) If you pull devel you should be good to go

@rooftopcellist
Copy link
Member

I can reproduce the newline issue you mentioned. I will look into it and report my findings (or PR) here.

@AndrewSav
Copy link
Author

@Zokormazo Why was it closed?

@Zokormazo
Copy link
Member

Sorry, I should have explained this while closing:

Colon thing was fixed on #404
Extra space on postgres on #411
Lost newlines thing was fixed on #427

I tested with a backup including tls secrets (newlines), a custom private ee image (identation and colons) and everything worked for me.

Do you have any other issue on this process? Feel free to reopen the issue or open a new one

@AndrewSav
Copy link
Author

Do you have any other issue on this process?

Nope, it's just because without the explanation above it was not clear why it was closed. Now it is, thank you.

@Zokormazo
Copy link
Member

Yeah, sorry about that :)

@rooftopcellist
Copy link
Member

@Zokormazo I ran through that one more time with this in the spec, I suspect there is still an issue here.

spec:
    extra_volumes: |
      - name: myvolume
        persistentVolumeClaim:
          claimName: manual-claim

The extra_volume passed via the spec gets added to the deployment (utilizing a pvc I created manually in this namespace)
This is what the applied spec looks like on the original awx deployment:

  spec:
    admin_email: admin@localhost
    admin_user: admin
    create_preload_data: true
    development_mode: false
    extra_volumes: |
      - name: myvolume
        persistentVolumeClaim:
          claimName: manual-claim

But it does fail at restore time:

yaml.scanner.ScannerError: mapping values are not allowed here\n  in \"<unicode string>\", line 217, column 66:\n     ... stentVolumeClaim:\\n    claimName: manual-claim\\n\n                                         ^\n", "module_stdout": "", "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error", "rc": 1}[0m

This is what the spec looks like on the restored deployment:

  spec:
    admin_email: admin@localhost
    admin_password_secret: awx-admin-password
    admin_user: admin
    broadcast_websocket_secret: awx-broadcast-websocket
    create_preload_data: true
    development_mode: false
    extra_volumes: '- name: myvolume\n  persistentVolumeClaim:\n    claimName: manual-claim\n'

@kdelee
Copy link
Member

kdelee commented Jun 28, 2021

sounds like @Zokormazo and @Spredzy consider this a release blocker because even if we fixed on 4.0.1, backups from 4.0.0 would be corrupted. it is generally a good idea to do a backup before an upgrade to 4.0.1, so if the upgrade failed, and the backup was unusable, the user would be out of luck

@Zokormazo
Copy link
Member

Tested with some extra_envs and now it works properly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:operator type:bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants