Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

I/O issues using flashcache, broken system files #193

Open
andrey-minsky opened this issue Nov 15, 2014 · 10 comments
Open

I/O issues using flashcache, broken system files #193

andrey-minsky opened this issue Nov 15, 2014 · 10 comments

Comments

@andrey-minsky
Copy link

Hello, we are using 2x480GB SSD RAID0 as cache for our VPS nodes, but with latest versions we have a lot of issues. Some examples of logs:

[ 42.722313] Buffer I/O error on device xvda1, logical block 309394
[ 42.722317] lost page write due to I/O error on xvda1
[ 42.730988] end_request: I/O error, dev xvda1, sector 2475344
[ 42.739014] end_request: I/O error, dev xvda1, sector 2475600

Issue can be solved by node reboot.

Also, Windows VPS after some time of using on node with flashcache have broken system files and not booting correctly - its very big problem...

@mohans
Copy link
Contributor

mohans commented Nov 16, 2014

Are there any flashcache messages in the messages file around the time these issues occur ?

What version of flashcache are you running ? Are you running master, top of the tree ?


From: andrey-minsky notifications@github.com
To: facebook/flashcache flashcache@noreply.github.com
Sent: Saturday, November 15, 2014 10:46 AM
Subject: [flashcache] I/O issues using flashcache, broken system files (#193)

Hello, we are using 2x480GB SSD RAID0 as cache for our VPS nodes, but with latest versions we have a lot of issues. Some examples of logs:
[ 42.722313] Buffer I/O error on device xvda1, logical block 309394
[ 42.722317] lost page write due to I/O error on xvda1
[ 42.730988] end_request: I/O error, dev xvda1, sector 2475344
[ 42.739014] end_request: I/O error, dev xvda1, sector 2475600
Issue can be solved by node reboot.
Also, Windows VPS after some time of using on node with flashcache have broken system files and not booting correctly - its very big problem...

Reply to this email directly or view it on GitHub.

@andrey-minsky
Copy link
Author

cat /proc/flashcache/md0+scsi-3600605b0057db8401a4b346b2d4d3838-part2/flashcache_stats
reads=148535851 writes=107025354
read_hits=33246752 read_hit_percent=22 write_hits=77837038 write_hit_percent=72 replacement=22503167 write_replacement=8191551 write_invalidates=95064 read_invalidates=162421 pending_enqueues=152 pending_inval=152 no_room=0 disk_reads=115292063 disk_writes=78626210 ssd_reads=33247169 ssd_writes=219983065 uncached_reads=1543878 uncached_writes=787582 uncached_IO_requeue=0 uncached_sequential_reads=0 uncached_sequential_writes=0 pid_adds=0 pid_dels=0 pid_drops=0 pid_expiry=0

I mean write_invalidates/read_invalidates errors. We are using master version flashcache-3.1.1

@mohans
Copy link
Contributor

mohans commented Nov 16, 2014

Are there flashcache messages in /var/log/messages that show IO errors either from flash or disk ? Flashcache is returning EIO (or at least the application is reporting EIO). Are either the underlying disk or flash returning EIO ?

Can you paste the output from dmsetup status and dmsetup table ?

Write Invalidate is not an error.


From: andrey-minsky notifications@github.com
To: facebook/flashcache flashcache@noreply.github.com
Cc: Mohan Srinivasan mohan_srinivasan@yahoo.com
Sent: Sunday, November 16, 2014 4:29 AM
Subject: Re: [flashcache] I/O issues using flashcache, broken system files (#193)

cat /proc/flashcache/md0+scsi-3600605b0057db8401a4b346b2d4d3838-part2/flashcache_stats
reads=148535851 writes=107025354
read_hits=33246752 read_hit_percent=22 write_hits=77837038 write_hit_percent=72 replacement=22503167 write_replacement=8191551 write_invalidates=95064 read_invalidates=162421 pending_enqueues=152 pending_inval=152 no_room=0 disk_reads=115292063 disk_writes=78626210 ssd_reads=33247169 ssd_writes=219983065 uncached_reads=1543878 uncached_writes=787582 uncached_IO_requeue=0 uncached_sequential_reads=0 uncached_sequential_writes=0 pid_adds=0 pid_dels=0 pid_drops=0 pid_expiry=0
I mean write_invalidates errors. We are using master version /lib/modules/3.16.1/extra/flashcache/flashcache.ko

Reply to this email directly or view it on GitHub.

@Kaydannik
Copy link

We had same errors on setup with HDD storage with 4k block size.
Storage was 4k 18 TB. SDD was usual 512 blocks size. As result we were receiving errors
end_request: I/O error, dev sda1 , sector XXXXX and dataloss.

We tried different settings on cache creates, blocks sizes etc.. But was solved just by moving back to 512b sectors on HDD

@andrey-minsky
Copy link
Author

Kaydannik, usefull information, thanks, but our node in production already and we can change block size, also, before, in older versions everything was ok. May be possible somehow fix this issue in new version of flashcache? Or we can format SSDs with same block size as HDDs?

@andrey-minsky
Copy link
Author

We check once again, our block sizes are similar and equal to 512.

We do not have any flashcache errors in logs, requested logs attached:
http://185.4.64.3/dmsetup-status.txt
http://185.4.64.3/dmsetup-table.txt

@JQuags
Copy link

JQuags commented Mar 3, 2015

I have a similar set up with flashcache running on KVM and windows templates. On Linux templates the same settings work fine. The issue is system files randomly become corrupted, in write through mode. Simply turning off cache all solves the issue. Flashcache reports no errors. I am including information below.

I will do a more detailed test soon with virtio on / off, and different options in KVM/libvirt. But the KVM settings the same way do work on linux guests.

version flashcache-3.1.1
kernel 2.6.32-504.8.1.el6.x86_64
mode writethrough

status
cachedev_vz: 0 3730019968 flashcache stats:
reads(123395096), writes(5656486)
read hits(934953), read hit percent(0)
write hits(13189) write hit percent(0)
replacement(11), write replacement(0)
write invalidates(39108), read invalidates(15214)
pending enqueues(30544), pending inval(30496)
no room(0)
disk reads(122466306), disk writes(5603184) ssd reads(934953) ssd writes(351888)
uncached reads(122180911), uncached writes(5589995), uncached IO requeue(6163)
disk read errors(0), disk write errors(0) ssd read errors(0) ssd write errors(0)
uncached sequential reads(118727551), uncached sequential writes(4630111)
pid_adds(0), pid_dels(0), pid_drops(0) pid_expiry(0)
lru hot blocks(26312960), lru warm blocks(26312960)
lru promotions(234809), lru demotions(0)

table
cachedev_vz: 0 3730019968 flashcache conf:
ssd dev (/dev/md131), disk dev (/dev/md127) cache mode(WRITE_THROUGH)
capacity(205570M), associativity(512), data block size(4K)
disk assoc(256K)
skip sequential thresh(64K)
total blocks(52625920), cached blocks(260033), cache percent(0)
nr_queued(0)
Size Hist: 512:23946 1024:125609 1536:39395 2048:43893 2560:36999 3072:121873 3584:17439 4096:128642416

stats
reads=123395136 writes=5656487
read_hits=934983 read_hit_percent=0 write_hits=13189 write_hit_percent=0 replacement=11 write_replacement=0 write_invalidates=39108 read_invalidates=15214 pending_enqueues=30544 pending_inval=30496 no_room=0 disk_reads=122466316 disk_writes=5603185 ssd_reads=934983 ssd_writes=351888 uncached_reads=122180921 uncached_writes=5589996 uncached_IO_requeue=6163 uncached_sequential_reads=118727551 uncached_sequential_writes=4630111 pid_adds=0 pid_dels=0 pid_drops=0 pid_expiry=0

disk_read_errors=0 disk_write_errors=0 ssd_read_errors=0 ssd_write_errors=0 memory_alloc_errors=0

@JQuags
Copy link

JQuags commented Mar 4, 2015

A little update on my above post.

I can reproduce the errors with flashcache on and windows consistently but also appear to have a solution. Setting a password or windows updates results in errors like: http://i.is.cc/1yPQJu2r.png (second or third time it goes though)

My normal set up for linux and windows servers in the xml file is

driver name='qemu' type='raw' cache='none'/

if I change this to

driver name='qemu' type='raw' cache='writeback' threads='native'/

I no longer see any errors in windows guests with flashcache.

@ghost
Copy link

ghost commented Aug 4, 2015

Thank you for reporting this issue and appreciate your patience. We've notified the core team for an update on this issue. We're looking for a response within the next 30 days or the issue may be closed.

@pubyun
Copy link
Contributor

pubyun commented Oct 26, 2015

same as issue:
#133

can be reproduced easily.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants