Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make LVM snapshot default when no issues get reported #1986

Open
szaimen opened this issue May 22, 2021 · 51 comments · Fixed by #2510
Open

Make LVM snapshot default when no issues get reported #1986

szaimen opened this issue May 22, 2021 · 51 comments · Fixed by #2510

Comments

@szaimen
Copy link
Collaborator

szaimen commented May 22, 2021

This is just a reminder that we don't forget to make the LVM snapshot default when no issues get reported.

check_command lvcreate --size 5G --name "NcVM-installation" ubuntu-vg

After we do this, everyone wil be able to use the built-in backup solution.

@enoch85
Copy link
Member

enoch85 commented May 22, 2021

I'd say when development stopped on that part for some time because it's rock solid, then maybe. :)

@szaimen
Copy link
Collaborator Author

szaimen commented May 22, 2021

I'd say it is already pretty stable but yeah

@enoch85
Copy link
Member

enoch85 commented Jul 31, 2021

We could do this for Ubuntu 22.04 making the OS disk 45 GB in size, or extend the drive so there's only 5 GB left and keep it 40 GB in total size.

Would that work?

@enoch85
Copy link
Member

enoch85 commented Jul 31, 2021

cc @small1

@szaimen
Copy link
Collaborator Author

szaimen commented Aug 2, 2021

We could do this for Ubuntu 22.04 making the OS disk 45 GB in size, or extend the drive so there's only 5 GB left and keep it 40 GB in total size.

Would that work?

From my side, yes 👍

@enoch85
Copy link
Member

enoch85 commented Oct 12, 2021

Just tested on one of my prod instances. I don't think this is stable enough:

Last login: Thu Oct  7 12:49:11 2021 from blablabla
root@cloud:~# bash /var/scripts/update.sh 
Posting notification to users that are admins, this might take a while...
Posting 'Update script started!' to: enoch85
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
Maintenance mode enabled
  Logical volume ubuntu-vg/NcVM-snapshot is used by another device.
Maintenance mode disabled
Starting docker...
Posting notification to users that are admins, this might take a while...
Posting 'Update failed!' to: enoch85

@szaimen
Copy link
Collaborator Author

szaimen commented Oct 12, 2021

Logical volume ubuntu-vg/NcVM-snapshot is used by another device.

Honestly, I've never seen this issue. What did you do before this issue appeard?
Did you reinstall ubuntu from scratch and choosed to add the partition in the install script?

@szaimen
Copy link
Collaborator Author

szaimen commented Oct 12, 2021

Did you reinstall ubuntu from scratch and choosed to add the partition in the install script?

This way I'am running this setup since half a year or longer without any issue...

@szaimen
Copy link
Collaborator Author

szaimen commented Oct 12, 2021

Or in other words: what are the steps to reproduce this issue?

@enoch85
Copy link
Member

enoch85 commented Oct 13, 2021

Did you reinstall ubuntu from scratch and choosed to add the partition in the install script?

Yes, since the company was sold, we moved the whole thing to a new server with a new install and export import of DB and stuff. So it's by the book installed "your way".

Or in other words: what are the steps to reproduce this issue?

I don't know. I just ran an update yesterday and it happened. No automatic updates either.

@szaimen
Copy link
Collaborator Author

szaimen commented Oct 13, 2021

Thanks! So then I will try to investigate how this could happen :)

@szaimen
Copy link
Collaborator Author

szaimen commented Oct 31, 2021

Does it happen every time you run the update script?

@szaimen
Copy link
Collaborator Author

szaimen commented Oct 31, 2021

Could be a bug with lvm...
https://blog.roberthallam.org/2017/12/solved-logical-volume-is-used-by-another-device/

Could you please try the following commands and post the output of those here (if it should still happen)?

lvremove -v /dev/ubuntu-vg/NcVM-snapshot

dmsetup info -c | grep NcVM | grep snapshot

# more to come when we have more info based on the guide linked above

@szaimen
Copy link
Collaborator Author

szaimen commented Nov 18, 2021

@enoch85 do you have some feedback here? It is hard to debug without a way to reproduce this issue...

@enoch85
Copy link
Member

enoch85 commented Nov 19, 2021

As it's not in the released version yet, please add a PR with the fix you proposed, and I'll run one of the auto update VMs with the new setup.

@szaimen
Copy link
Collaborator Author

szaimen commented Nov 19, 2021

I can try. But after reading through the code, did you try to reboot the affected server once after you got the notification that the update failed because of the failed lvremove?
image

@enoch85
Copy link
Member

enoch85 commented Nov 19, 2021

I've only seen this once, and I'm not sure if the server was rebooted or not.

If you think it can be improved, then do so, else leave it for now.

@szaimen
Copy link
Collaborator Author

szaimen commented Nov 19, 2021

Thanks for the feedback!
Honestly, since I still think that this is a bug in LVM itself, I don't think I can improve the logic/code. I could try to work around the symptoms but not solve the issue itself. So a reboot is probably still the best option in this case.
Since you only saw this once, I think its fine, though. Do you agree?

@enoch85
Copy link
Member

enoch85 commented Nov 19, 2021

I'm still not convinced it should be the default way of the VM. One more thing that could break - we want to keep those events limited.

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

It happened again.

  1. Run menu.sh --> minor
  2. It finished as expected
  3. Run menu.sh --> update again

image

image

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

Some debug output:

Posting 'Update script started!' to: enoch85
++ hostname -f
+ nextcloud_occ_no_check notification:generate -l 'The update script in the Nextcloud VM has been executed.
You will be notified when the update is done.
Please don'\''t shutdown or restart your server until then.' enoch85 'cloud.hanssonit.se: Update script started!'
+ sudo -u www-data php /var/www/nextcloud/occ notification:generate -l 'The update script in the Nextcloud VM has been executed.
You will be notified when the update is done.
Please don'\''t shutdown or restart your server until then.' enoch85 'cloud.hanssonit.se: Update script started!'
+ check_free_space
+ vgs
++ vgs
++ grep ubuntu-vg
++ awk '{print $7}'
++ grep -oP '[0-9]+\.[0-9]'
++ sed 's|\.||'
++ grep g
+ FREE_SPACE=
+ '[' -z '' ']'
+ FREE_SPACE=0
+ '[' -f /var/scripts/nextcloud-startup-script.sh ']'
+ does_snapshot_exist NcVM-startup
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-startup ']'
+ return 1
+ does_snapshot_exist NcVM-snapshot
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-snapshot ']'
+ return 0
+ '[' -f /var/scripts/daily-borg-backup.sh ']'
+ crontab -u root -l
+ grep -v 'lvrename /dev/ubuntu-vg/NcVM-snapshot-pending'
+ crontab -u root -
+ crontab -u root -l
+ cat
+ crontab -u root -
+ echo '@reboot /usr/sbin/lvrename /dev/ubuntu-vg/NcVM-snapshot-pending /dev/ubuntu-vg/NcVM-snapshot &>/dev/null'
+ SNAPSHOT_EXISTS=1
+ is_docker_running
+ docker ps -a
+ check_command systemctl stop docker
+ systemctl stop docker
Warning: Stopping docker.service, but it can still be activated by:
  docker.socket
+ nextcloud_occ maintenance:mode --on
+ check_command sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
+ sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --on
Maintenance mode enabled
+ does_snapshot_exist NcVM-startup
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-startup ']'
+ return 1
+ does_snapshot_exist NcVM-snapshot
+ local SNAPSHOTS
+ local snapshot
+ lvs
++ lvs
++ grep ubuntu-vg
++ awk '{print $1}'
++ grep -v ubuntu-lv
+ SNAPSHOTS=NcVM-snapshot
+ '[' -z NcVM-snapshot ']'
+ mapfile -t SNAPSHOTS
+ for snapshot in "${SNAPSHOTS[@]}"
+ '[' NcVM-snapshot = NcVM-snapshot ']'
+ return 0
+ lvremove /dev/ubuntu-vg/NcVM-snapshot -y
  Logical volume ubuntu-vg/NcVM-snapshot is used by another device.
+ nextcloud_occ maintenance:mode --off
+ check_command sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
+ sudo -u www-data php /var/www/nextcloud/occ maintenance:mode --off
Maintenance mode disabled
+ start_if_stopped docker
+ pgrep docker
+ print_text_in_color '\e[0;96m' 'Starting docker...'
+ printf '%b%s%b\n' '\e[0;96m' 'Starting docker...' '\e[0m'
Starting docker...
+ systemctl start docker.service
++ date +%T
+ notify_admin_gui 'Update failed!' 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33'
+ local NC_USERS
+ local user
+ local admin
+ is_app_enabled notifications
+ sed '/Disabled/,$d'
+ awk '{print$2}'
+ nextcloud_occ app:list
+ check_command sudo -u www-data php /var/www/nextcloud/occ app:list
+ sudo -u www-data php /var/www/nextcloud/occ app:list
+ sed '/^$/d'
+ grep -q '^notifications$'
+ tr -d :
+ return 0
+ print_text_in_color '\e[0;96m' 'Posting notification to users that are admins, this might take a while...'
+ printf '%b%s%b\n' '\e[0;96m' 'Posting notification to users that are admins, this might take a while...' '\e[0m'
Posting notification to users that are admins, this might take a while...
+ send_mail 'Update failed!' 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33'
+ local RECIPIENT
+ '[' -f /etc/msmtprc ']'
+ return 1
+ '[' -z enoch85 ']'
+ for admin in "${NC_ADMIN_USER[@]}"
+ print_text_in_color '\e[0;92m' 'Posting '\''Update failed!'\'' to: enoch85'
+ printf '%b%s%b\n' '\e[0;92m' 'Posting '\''Update failed!'\'' to: enoch85' '\e[0m'
Posting 'Update failed!' to: enoch85
++ hostname -f
+ nextcloud_occ_no_check notification:generate -l 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33' enoch85 'cloud.hanssonit.se: Update failed!'
+ sudo -u www-data php /var/www/nextcloud/occ notification:generate -l 'Could not remove NcVM-snapshot - Please reboot your server! 13:29:33' enoch85 'cloud.hanssonit.se: Update failed!'
+ msg_box 'It seems like the old snapshot could not get removed.
This should work again after a reboot of your server.'
+ '[' -n '' ']'
+ whiptail --title 'Nextcloud VM - 2022 - Nextcloud Update Script' --msgbox 'It seems like the old snapshot could not get removed.
This should work again after a reboot of your server.' '' ''
+ exit 1

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

Thanks for the verbose output! Please try the following and report back:

lvremove -v /dev/ubuntu-vg/NcVM-snapshot

dmsetup info -c | grep NcVM | grep snapshot

# more to come when we have more info based on the guide linked above

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

Already rebooted ;/

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

Already rebooted ;/

hm :/

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

When I have the output it should only take one command to remove the blocking device and afterwards the lvremove should finally work :)
This would then be a better way to solve this instead of rebooting that we can automate in case lvremove fails :)

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

root@cloud:~# ls -la /sys/dev/block/253\:3/holders
total 0
drwxr-xr-x 2 root root 0 jan 29 13:33 .
drwxr-xr-x 9 root root 0 jan 29 13:33 ..
root@cloud:~# ls -la /sys/dev/block/253\:2/holders
total 0
drwxr-xr-x 2 root root 0 jan 29 13:33 .
drwxr-xr-x 9 root root 0 jan 29 13:33 ..
lrwxrwxrwx 1 root root 0 jan 29 14:29 dm-3 -> ../../dm-3

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

Thanks! after runing the following command, the removal should work. please report back!

dmsetup remove /dev/dm-3
lvremove -v /dev/ubuntu-vg/NcVM-snapshot

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

If that works, I will try to come up with a PR that fixes this once and for all :)

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

root@cloud:~# dmsetup remove /dev/dm-3
device-mapper: remove ioctl on ubuntu--vg-NcVM--snapshot  failed: Device or resource busy
Command failed.
root@cloud:~# lvremove -v /dev/ubuntu-vg/NcVM-snapshot
  Logical volume ubuntu-vg/NcVM-snapshot in use.

Thanks for looking into this!

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

Still same issue! :(

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

I really hoped that this would solve the peoblem. (I followed the steps from the guide that I linked above). Seems like there is unfortunately still no way around a restart then :(

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

Don't know how "safe" this is, but it works:

root@cloud:~# dmsetup remove /dev/dm-3
device-mapper: remove ioctl on ubuntu--vg-NcVM--snapshot  failed: Device or resource busy
Command failed.
root@cloud:~# dmsetup info -c
Name                          Maj Min Stat Open Targ Event  UUID                                                                     
ubuntu--vg-NcVM--snapshot     253   3 L--w    1    1      2 LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7Kpttkwe5oq1zHuHZW7Ia6auXkP4fS59G1HaSX     
ubuntu--vg-NcVM--snapshot-cow 253   2 L--w    1    1      2 LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7Kpttkwe5oq1zHuHZW7Ia6auXkP4fS59G1HaSX-cow 
ubuntu--vg-ubuntu--lv         253   1 L--w    1    1      0 LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7KpttknN5hoWjgMNGj3HexKoMt4aoQbfRfCVu7     
ubuntu--vg-ubuntu--lv-real    253   0 L--w    2    2      0 LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7KpttknN5hoWjgMNGj3HexKoMt4aoQbfRfCVu7-real
root@cloud:~# fuser -m ubuntu--vg-NcVM--snapshot
Specified filename ubuntu--vg-NcVM--snapshot does not exist.
root@cloud:~# fuser -m /dev/ubuntu-vg/NcVM-snapshot
/dev/dm-3:            6113
root@cloud:~# kill -9 6113
root@cloud:~# lvremove -v /dev/ubuntu-vg/NcVM-snapshot
Do you really want to remove and DISCARD active logical volume ubuntu-vg/NcVM-snapshot? [y/n]: y
  Accepted input: [y]
  Archiving volume group "ubuntu-vg" metadata (seqno 37).
  Removing snapshot volume ubuntu-vg/NcVM-snapshot.
  Loading table for ubuntu--vg-ubuntu--lv (253:1).
  Loading table for ubuntu--vg-NcVM--snapshot (253:3).
  Not monitoring ubuntu-vg/NcVM-snapshot with libdevmapper-event-lvm2snapshot.so
  Unmonitored LVM-k9Rc3WOCi8FftbHl00Er0pzO7k7Kpttkwe5oq1zHuHZW7Ia6auXkP4fS59G1HaSX for events
  Suspending ubuntu--vg-ubuntu--lv (253:1) with device flush
  Suspending ubuntu--vg-NcVM--snapshot (253:3) with device flush
  Suspending ubuntu--vg-ubuntu--lv-real (253:0) with device flush
  Suspending ubuntu--vg-NcVM--snapshot-cow (253:2) with device flush
  activation/volume_list configuration setting not defined: Checking only host tags for ubuntu-vg/NcVM-snapshot.
  Resuming ubuntu--vg-NcVM--snapshot-cow (253:2).
  Resuming ubuntu--vg-ubuntu--lv-real (253:0).
  Resuming ubuntu--vg-NcVM--snapshot (253:3).
  Resuming ubuntu--vg-ubuntu--lv (253:1).
  Removing ubuntu--vg-ubuntu--lv-real (253:0)
  Removing ubuntu--vg-NcVM--snapshot (253:3)
  Removing ubuntu--vg-NcVM--snapshot-cow (253:2)
  Releasing logical volume "NcVM-snapshot"
  Creating volume group backup "/etc/lvm/backup/ubuntu-vg" (seqno 39).
  Logical volume "NcVM-snapshot" successfully removed
root@cloud:~# 

https://antnix07.blogspot.com/2018/02/lvmdevice-mapper-remove-ioctl-on-failed.html

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

Resuled in this error: df: /var/lib/os-prober/mount: Transport endpoint is not connected and no snapshot was made, so not safe...

@enoch85
Copy link
Member

enoch85 commented Jan 29, 2022

Hehe, that command removed the whole snapshot thing. It never does snapshots anymore after that. :D

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

Yeah, so I still think that this is some kind of issue of lvm. I hope that this will be solved with ubuntu 22.04 but I guess we'll see.
For now I am preparing a PR that highlight steps how to increase the root partition manually in order to be able to use the backup script...

@szaimen
Copy link
Collaborator Author

szaimen commented Jan 29, 2022

It never does snapshots anymore after that. :D

nice to know if thats desired ;)

@enoch85
Copy link
Member

enoch85 commented Mar 26, 2022

@szaimen Maybe if we name the snapshot to something random like NC_snapshot_XYZ123? Then it wouldn't conflict with other snapshots on the system.

@szaimen
Copy link
Collaborator Author

szaimen commented Mar 26, 2022

@szaimen Maybe if we name the snapshot to something random like NC_snapshot_XYZ123? Then it wouldn't conflict with other snapshots on the system.

Problem is that we use that name for locking so that no further backup or e.g. update gets started if that snapshot exists so this would probably need some (or a lot of) refactoring. Also is this clearly a bug in LVM if snapshot removal is not working so I'd rather wait for a new Ubuntu version and hope that it is fixed there.

@enoch85
Copy link
Member

enoch85 commented Apr 2, 2023

Ubuntu 22.04 is out, same thing still...

How should we proceed? Leaning towards closing and removing the function all together. Or, add a notification that you need to reboot.

@szaimen
Copy link
Collaborator Author

szaimen commented Apr 10, 2023

Leaning towards closing and removing the function all together.

That would break all backup scripts so please don't do this.

Or, add a notification that you need to reboot.

Sounds good but there is one already here?

vm/nextcloud_update.sh

Lines 156 to 164 in f1ff45f

if ! lvremove /dev/ubuntu-vg/NcVM-snapshot -y
then
nextcloud_occ maintenance:mode --off
start_if_stopped docker
notify_admin_gui "Update failed!" \
"Could not remove NcVM-snapshot - Please reboot your server! $(date +%T)"
msg_box "It seems like the old snapshot could not get removed.
This should work again after a reboot of your server."
exit 1

@szaimen
Copy link
Collaborator Author

szaimen commented Apr 10, 2023

Should we improve the notification maybe somehow?

@enoch85
Copy link
Member

enoch85 commented Apr 10, 2023

I vote for improve notification, and force reboot if LVM snapshots are made.

@szaimen
Copy link
Collaborator Author

szaimen commented Apr 11, 2023

I vote for improve notification

Do you have some suggestion how to improve it?

and force reboot if LVM snapshots are made.

So directly after the update script then or when should the reboot happen?

@enoch85
Copy link
Member

enoch85 commented Apr 11, 2023

So directly after the update script then or when should the reboot happen?

Yes, if LVM snapshot is enabled and a snapshot exist.

o you have some suggestion how to improve it?

Just add another msg_box before installation that "warns" the user that updates will be forced. Make it clear. 👍

@thatstheplace
Copy link

Hello,

I'm getting this at new install on ubuntu 22.04.2 LTS when chosing yes for LVM snapshots right in the second question.

nextcloud_install_production.sh: line 113: [: 18,4: integer expression expected
Could not create volume because of insufficient space...

nextcloud_install_production.sh: line 113: [: 50,4: integer expression expected
Could not create volume because of insufficient space...

@szaimen
Copy link
Collaborator Author

szaimen commented Jun 22, 2023

Hi, can you post the output of sudo vgs | grep ubuntu-vg

@thatstheplace
Copy link

thatstheplace commented Jun 22, 2023

I'm getting ubuntu-vg 1 1 0 wz--n- <36,95g 18,47g

@szaimen
Copy link
Collaborator Author

szaimen commented Jun 22, 2023

Fix is in #2510

@szaimen szaimen reopened this Jun 23, 2023
@thatstheplace
Copy link

thatstheplace commented Aug 23, 2023

Hi,

I know this option isn't default yet, but when you installing the VM with your scripts, first you're asking if all disk space should be used and after this you want to create the snapshots which is failing if you said yes before because of insufficient disk space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants