Discussion:
Webrev: Issue 2024 - ZFS dedup=on should not panic the system when two blocks with same checksum exist in DDT
(too old to reply)
Jim Klimov
2012-04-20 00:16:01 UTC
Permalink
NB: Sorry to those who see this message twice, I initially
posted it to the wrong list, as Alasdair was kind to notice.
Text unchanged, but there was sadly a typo in the webrev so
it was re-uploaded =\
Assuming you have keys in redmine already (so that I don't have to
i.e. webrev -t rsync://***@alpha.sysmgr.org:issue2024 -O -U

Heh, it works and is quite simple - must be magic...
gotta get this into Wiki when the solution is polished ;)

So here goes my first webrev attempt:
http://alpha.sysmgr.org/~webrev/jimklimov/issue2024/

I've reported on the issue in detail on the zfs-discuss list
this winter, and in illumos bugtracker as issue #2024
ZFS dedup=on should not panic the system when two blocks
with same checksum exist in DDT
https://www.illumos.org/issues/2024

In short, I had some corruption on a pool with deduped data.
DDT entries had a checksum, same as in BP, but the actual
data block was corrupted and mismatched the checksum.

If the block identical to original (same checksum) is written
with dedup=verify, then ZFS detects that the newly written
block is different from ondisk data, and makes another DDT
entry. If the original block is later rewritten with dedup=on
(no verification) ZFS panics the kernel.

Stacktrace research led to some routines which can validly
return a NULL pointer during DDT lookups, but other code
assumes that a pointer is always valid - thus my kernel got
NULL pointer dereferences and panicked.

This panic also happens soon after pool reimport (after
reboot), I guess some processing like deferred-free kicks in.

The proposed fix adds some non-NULLness checks to the routine
in question and some routines near it in the source code file.
I added kernel logging in case the lookup returns NULL; it may
be correct to invoke the message only if the kernel is DEBUG...
or maybe the users should always know there is something foul
with their pools?

It is possible that some of these checks may be not needed,
or that I missed some locations where similar checks are due.
It is also possible that the solution may lead to block leaks.
As an interim solution, I prefered to potentially lose a few
MBs of disk space rather than not have access to the whole pool.

ZFS code gurus are kindly asked to review the solution here:
http://alpha.sysmgr.org/~webrev/jimklimov/issue2024/
https://www.illumos.org/issues/2024

Thanks,
//Jim Klimov


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com
George Wilson
2012-04-20 14:25:06 UTC
Permalink
Jim,

Wow, so many issues here. Thanks for digging into them and braving the
dangerous waters of dedup. :-)

First, I don't think the fix as you have it is safe. It will lead to
leaked blocks and allow on-disk corruption to just be ignored. A better
solution would be to leverage 'zfs_recover' in zio_ddt_free() (I don't
think you need any of the changes in ddt.c). This is what I would propose:

freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
if (zfs_recover && dde == NULL) {
char *blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP);
sprintf_blkptr(blkbuf, bp);
cmn_err(CE_WARN, "Unable to free bp %s\n", blkbuf);
kmem_free(blkbuf, BP_SPRINTF_LEN);
ddt_exit(ddt);
return (ZIO_PIPELINE_CONTINUE);
}

Given that you're hitting the panic during zio_ddt_free() it would
indicate that somehow the block that is being freed did not make it into
the ddt but yet the dedup bit is set in its blkptr. I looked quickly at
the code and I think I may have found the source for this bug but would
like to enlist your help in reproducing it. Do you still have your
environment setup where you can reproduce this?

Thanks,
George
Post by Jim Klimov
NB: Sorry to those who see this message twice, I initially
posted it to the wrong list, as Alasdair was kind to notice.
Text unchanged, but there was sadly a typo in the webrev so
it was re-uploaded =\
Assuming you have keys in redmine already (so that I don't have to
Heh, it works and is quite simple - must be magic...
gotta get this into Wiki when the solution is polished ;)
http://alpha.sysmgr.org/~webrev/jimklimov/issue2024/
I've reported on the issue in detail on the zfs-discuss list
this winter, and in illumos bugtracker as issue #2024
ZFS dedup=on should not panic the system when two blocks
with same checksum exist in DDT
https://www.illumos.org/issues/2024
In short, I had some corruption on a pool with deduped data.
DDT entries had a checksum, same as in BP, but the actual
data block was corrupted and mismatched the checksum.
If the block identical to original (same checksum) is written
with dedup=verify, then ZFS detects that the newly written
block is different from ondisk data, and makes another DDT
entry. If the original block is later rewritten with dedup=on
(no verification) ZFS panics the kernel.
Stacktrace research led to some routines which can validly
return a NULL pointer during DDT lookups, but other code
assumes that a pointer is always valid - thus my kernel got
NULL pointer dereferences and panicked.
This panic also happens soon after pool reimport (after
reboot), I guess some processing like deferred-free kicks in.
The proposed fix adds some non-NULLness checks to the routine
in question and some routines near it in the source code file.
I added kernel logging in case the lookup returns NULL; it may
be correct to invoke the message only if the kernel is DEBUG...
or maybe the users should always know there is something foul
with their pools?
It is possible that some of these checks may be not needed,
or that I missed some locations where similar checks are due.
It is also possible that the solution may lead to block leaks.
As an interim solution, I prefered to potentially lose a few
MBs of disk space rather than not have access to the whole pool.
http://alpha.sysmgr.org/~webrev/jimklimov/issue2024/
https://www.illumos.org/issues/2024
Thanks,
//Jim Klimov
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
https://www.listbox.com/member/archive/rss/182179/21175078-ef5c1f3a
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com
Jim Klimov
2012-04-20 16:34:39 UTC
Permalink
Hello George, and thanks for the comments!
Post by George Wilson
Jim,
Wow, so many issues here. Thanks for digging into them and braving the
dangerous waters of dedup. :-)
First, I don't think the fix as you have it is safe. It will lead to
leaked blocks and allow on-disk corruption to just be ignored.
I figured there would be something like this, so I asked ;)
Post by George Wilson
A better solution would be to leverage 'zfs_recover' in zio_ddt_free()
(I don't think you need any of the changes in ddt.c).
Maybe so... still, as I wrote, the core problem leading to
kernel panic was that the system could follow a valid algorithm
and make a NULL pointer, and then used the provided value without
checking if it is NULL or not. That's what I fixed in webrev ;)

There are many programmers who prefer to never trust parameters
and other data provided by users (or API callers), at least not
blindly, especially in client-server programming, to speak in
networking terms. It is the server's (API's) job to always check
input sanity. At least in this case I think the price in CPU
cycles is not prohibitive, and the action of DDT-block-freeing
does not happen every millisecond 24/7 ;)

While the repairs of the calling block (client) in zio.c do
make the code more readable and valid, I think that those
API functions which start using their parameter as a pointer
(to structure) right away, should test that the pointer
is not NULL. We can fix the zio.c call now, but the same
mistake can be (or appear later) elsewhere in "client" code.
Post by George Wilson
freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
if (zfs_recover && dde == NULL) {
char *blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP);
sprintf_blkptr(blkbuf, bp);
cmn_err(CE_WARN, "Unable to free bp %s\n", blkbuf);
kmem_free(blkbuf, BP_SPRINTF_LEN);
ddt_exit(ddt);
return (ZIO_PIPELINE_CONTINUE);
}
What is the context? If the "if" clause does not fire, original
code should kick in? Namely:

ddp = ddt_phys_select(dde, bp);
ddt_phys_decref(ddp);
ddt_exit(ddt);
return (ZIO_PIPELINE_CONTINUE);

IIRC the problem was with "ddp" being NULL from ddt_phys_select().
Apparently "dde" is non-NULL when it gets there, but does not lead
to a valid (expected) lookup.

I am not yet acquainted with zfs_recover variable, I think...
What if the flag is not set (and I don't see a way for it to
get set in the kernel SPA) - the kernel should use the NULL
pointer and panic?

I groked the src, but only found the variable's definition and
usage in spa_misc.c, and it is *set* to nonzero only in zdb.c.

http://src.illumos.org/source/search?q=zfs_recover&project=illumos-gate&project=illumos-userland&defs=&refs=&path=&hist=

I do have the variable mentioned in my /etc/system, though:
set aok = 1
set zfs:zfs_recover = 1

So I guess it somehow gets set, and might work for users like me.
And kernel-panic by default for others ;)
Post by George Wilson
Given that you're hitting the panic during zio_ddt_free() it would
indicate that somehow the block that is being freed did not make it into
the ddt but yet the dedup bit is set in its blkptr. I looked quickly at
the code and I think I may have found the source for this bug but would
like to enlist your help in reproducing it. Do you still have your
environment setup where you can reproduce this?
Sort of, a few files were still not recovered for such
test purposes ;)

I think it should be relatively easy to reproduce the bug on
a test system (perhaps VM), at least on a simple test pool
with a single vdev where BP offsets would be linked to offset
in the pool's physical storage device or file:
* write some known data (file) into a dataset with DDT
* use ZDB to determine file blocks' on-disk location
* use DD to overwrite on-disk userdata blocks with garbage,
so that BP and DDT entries remain intact but the checksum
no longer matches the userdata block
* change dedup settings on the dataset (of, on, verify) to
test different scenarios
* copy known-good data back into the dataset (I used cp and
rsync for full-file recovery, and dd over the corrupt block
itself)

NOTE that I did not yet artificially create the testbed
like described above, but this is most likely what happened
to my system in its due course (possibly flaky HW components
or unknown bugs in other parts of zfs/kernel which led to
such on-disk corruptions of previously existing and valid
data; raidz2 did not fully save the day, except detecting the
problem - good already, as I had much of the data elsewhere).

Depending on dedup settings, I saw these behaviors:
* dedup=off - noop regarding DDT, creates a new block
* dedup=on with only the corrupted block on-disk -
a checksum is calculated on the newly written block,
then a matching checksum is found in DDT, counter is
increased (on-disk data remains corrupt)
* dedup=verify - a checksum is calculated on the newly
written block, then a matching checksum is found in DDT,
then the on-disk data block is read in and compared to
the userdata block, and is found to be different.
Then the new block is written to disk, a second DDT entry
and a new L0 blockpointer are created, and the indirect
tree is rewritten. On-disk data (file) becomes valid.
I guess at this moment the old on-disk block which is
no longer referenced becomes a candidate for (deferred)
freeing.
* If we return dedup=on and try to write the block, maybe
the system gets puzzled as it might only expect one DDT
entry to match? Or it finds an entry which points to an
unallocated BP and userdata block?

In neither case does the system detect/report/log the
checksum mismatch, like scrub or even mere cat of the
corrupted file do. I posted another bug to track that:
https://www.illumos.org/issues/2015

Even with dedup=verify, the userdata block contents are
being matched upon write, but checksum of on-disk data
is neither recalculated "just for kicks", nor does its
mismatch cause an increase in pool's CKSUM counters.
I wonder HOW it is read to bypass the BP checksumming?
Directly from the DVA? How, lacking the checksum match,
did it try to reconstruct the (corrupt) block from raidz2?

Those are related, but different questions. Hopefully
there would be discussion and fix in that as well ;)

Thanks,
//Jim


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com
George Wilson
2012-04-22 17:16:45 UTC
Permalink
Post by Jim Klimov
Hello George, and thanks for the comments!
Post by George Wilson
Jim,
Wow, so many issues here. Thanks for digging into them and braving the
dangerous waters of dedup. :-)
First, I don't think the fix as you have it is safe. It will lead to
leaked blocks and allow on-disk corruption to just be ignored.
I figured there would be something like this, so I asked ;)
Post by George Wilson
A better solution would be to leverage 'zfs_recover' in zio_ddt_free()
(I don't think you need any of the changes in ddt.c).
Maybe so... still, as I wrote, the core problem leading to
kernel panic was that the system could follow a valid algorithm
and make a NULL pointer, and then used the provided value without
checking if it is NULL or not. That's what I fixed in webrev ;)
There are many programmers who prefer to never trust parameters
and other data provided by users (or API callers), at least not
blindly, especially in client-server programming, to speak in
networking terms. It is the server's (API's) job to always check
input sanity. At least in this case I think the price in CPU
cycles is not prohibitive, and the action of DDT-block-freeing
does not happen every millisecond 24/7 ;)
While the repairs of the calling block (client) in zio.c do
make the code more readable and valid, I think that those
API functions which start using their parameter as a pointer
(to structure) right away, should test that the pointer
is not NULL. We can fix the zio.c call now, but the same
mistake can be (or appear later) elsewhere in "client" code.
The problem with this approach is that it requires the client to do the
proper thing on error. Take for example, your proposed change to
ddt_phys_addref():

void ddt_phys_addref(ddt_phys_t *ddp)
{
ASSERT(ddp != NULL);
if (ddp) {
ddp->ddp_refcnt++;
}
}

If this change is made into the gate then it just propagates corruption.
In order to make this change you need to ensure that all consumers of
ddt_phys_addref() know what to do if ddp is NULL and that's a big
project to tackle. :-)

To further this example, assume a non-debug system running with this
change and you're writing a new block to a dedup-ed filesystem. If ddp
were null then when you called ddt_phys_addref() from zio_ddt_write() we
would simply drive on without the DDT reference being updated for this
block. What has just happened is that you've allowed DDT corruption to
go undetected until some time in the future. When someone frees this
block ZFS would think that it's freeing the last reference so it
actually returns the block into the free pool. But wait! There was a
reference that we ignored sometime in the past and now you have a file
somewhere that is referencing a block that doesn't exist. Tracking that
down would be a nightmare!

I grant you that the DDT logic should be more robust but what its tried
to do is take the approach to panic (ie. null pointer dereference)
earlier rather than later.
Post by Jim Klimov
Post by George Wilson
freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
if (zfs_recover && dde == NULL) {
char *blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP);
sprintf_blkptr(blkbuf, bp);
cmn_err(CE_WARN, "Unable to free bp %s\n", blkbuf);
kmem_free(blkbuf, BP_SPRINTF_LEN);
ddt_exit(ddt);
return (ZIO_PIPELINE_CONTINUE);
}
What is the context? If the "if" clause does not fire, original
ddp = ddt_phys_select(dde, bp);
ddt_phys_decref(ddp);
ddt_exit(ddt);
return (ZIO_PIPELINE_CONTINUE);
The proposed code above was meant as a substitute to your changes in
zio.c. The idea is that if you know you have corruption (ie. you've
already panicked) you can can set zfs_recover and bypass the free which
is causing the panic (there are other places where zfs_recover does a
similar thing). Of course you end up leaking whatever blocks bypass this
normal zio_ddt_free() logic.
Post by Jim Klimov
IIRC the problem was with "ddp" being NULL from ddt_phys_select().
Apparently "dde" is non-NULL when it gets there, but does not lead
to a valid (expected) lookup.
That would make sense based on your scenario.
Post by Jim Klimov
I am not yet acquainted with zfs_recover variable, I think...
What if the flag is not set (and I don't see a way for it to
get set in the kernel SPA) - the kernel should use the NULL
pointer and panic?
I groked the src, but only found the variable's definition and
usage in spa_misc.c, and it is *set* to nonzero only in zdb.c.
http://src.illumos.org/source/search?q=zfs_recover&project=illumos-gate&project=illumos-userland&defs=&refs=&path=&hist=
set aok = 1
set zfs:zfs_recover = 1
So I guess it somehow gets set, and might work for users like me.
And kernel-panic by default for others ;)
Setting aok and zfs_recover would seem to imply that you had to work
past some previous corruption. I would recommend against running your
system with these values always set as it can mask other issue that
could be plaguing your pool.
Post by Jim Klimov
Post by George Wilson
Given that you're hitting the panic during zio_ddt_free() it would
indicate that somehow the block that is being freed did not make it into
the ddt but yet the dedup bit is set in its blkptr. I looked quickly at
the code and I think I may have found the source for this bug but would
like to enlist your help in reproducing it. Do you still have your
environment setup where you can reproduce this?
Sort of, a few files were still not recovered for such
test purposes ;)
I think it should be relatively easy to reproduce the bug on
a test system (perhaps VM), at least on a simple test pool
with a single vdev where BP offsets would be linked to offset
* write some known data (file) into a dataset with DDT
* use ZDB to determine file blocks' on-disk location
* use DD to overwrite on-disk userdata blocks with garbage,
so that BP and DDT entries remain intact but the checksum
no longer matches the userdata block
* change dedup settings on the dataset (of, on, verify) to
test different scenarios
* copy known-good data back into the dataset (I used cp and
rsync for full-file recovery, and dd over the corrupt block
itself)
NOTE that I did not yet artificially create the testbed
like described above, but this is most likely what happened
to my system in its due course (possibly flaky HW components
or unknown bugs in other parts of zfs/kernel which led to
such on-disk corruptions of previously existing and valid
data; raidz2 did not fully save the day, except detecting the
problem - good already, as I had much of the data elsewhere).
* dedup=off - noop regarding DDT, creates a new block
* dedup=on with only the corrupted block on-disk -
a checksum is calculated on the newly written block,
then a matching checksum is found in DDT, counter is
increased (on-disk data remains corrupt)
* dedup=verify - a checksum is calculated on the newly
written block, then a matching checksum is found in DDT,
then the on-disk data block is read in and compared to
the userdata block, and is found to be different.
Then the new block is written to disk, a second DDT entry
and a new L0 blockpointer are created, and the indirect
tree is rewritten. On-disk data (file) becomes valid.
I guess at this moment the old on-disk block which is
no longer referenced becomes a candidate for (deferred)
freeing.
I believe I have found the bug in the zio_ddt_write() that affects the
dedup=verify logic you mention above. Here's what I think is happening:

With the verify bit set we calculate the checksum and find that a bp
with that checksum already exists in the DDT so we compare the data and
find that the data is actually different. What is supposed to happen
next is that we disable dedup on the new block and write it out as an
ordinary block. The problem is that we never reset dedup bit in the
block pointer. As a result we have just written a block that is not in
the DDT but thinks it should be. I've filed a bug and here's the fix:

http://code.delphix.com/webrev-2649/

- George
Post by Jim Klimov
* If we return dedup=on and try to write the block, maybe
the system gets puzzled as it might only expect one DDT
entry to match? Or it finds an entry which points to an
unallocated BP and userdata block?
In neither case does the system detect/report/log the
checksum mismatch, like scrub or even mere cat of the
https://www.illumos.org/issues/2015
Even with dedup=verify, the userdata block contents are
being matched upon write, but checksum of on-disk data
is neither recalculated "just for kicks", nor does its
mismatch cause an increase in pool's CKSUM counters.
I wonder HOW it is read to bypass the BP checksumming?
Directly from the DVA? How, lacking the checksum match,
did it try to reconstruct the (corrupt) block from raidz2?
Those are related, but different questions. Hopefully
there would be discussion and fix in that as well ;)
Thanks,
//Jim
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
https://www.listbox.com/member/archive/rss/182179/21175078-ef5c1f3a
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com
Jim Klimov
2012-04-22 18:58:00 UTC
Permalink
Comments/questions inline below...
Post by George Wilson
Post by Jim Klimov
API functions which start using their parameter as a pointer
(to structure) right away, should test that the pointer
is not NULL. We can fix the zio.c call now, but the same
mistake can be (or appear later) elsewhere in "client" code.
The problem with this approach is that it requires the client to do the
proper thing on error.
...
I grant you that the DDT logic should be more robust but what its tried
to do is take the approach to panic (ie. null pointer dereference)
earlier rather than later.
...
The proposed code above was meant as a substitute to your changes in
zio.c. The idea is that if you know you have corruption (ie. you've
already panicked) you can can set zfs_recover and bypass the free which
is causing the panic (there are other places where zfs_recover does a
similar thing). Of course you end up leaking whatever blocks bypass this
normal zio_ddt_free() logic.
To check that I understood correctly: the code which faces a bad
external problem (such as my inconsistent on-disk data) SHOULD
panic in the API "server" by default so that for future code
builds and written datasets this error condition would be known
and properly avoided (in "client" code)?

This does make sense, though is of little consolence to those
users who already have old code and errors in the on-disk data
structures, and get panic after panic ;)
Post by George Wilson
...
If this change is made into the gate then it just propagates corruption.
In order to make this change you need to ensure that all consumers of
ddt_phys_addref() know what to do if ddp is NULL and that's a big
project to tackle. :-)
So we'd rather do this one caller at a time, on a per-incident basis? ;)
Well, that hopefully gives us reproducible use-cases to test the
solutions, so sarcasm is partial.
Post by George Wilson
Post by Jim Klimov
IIRC the problem was with "ddp" being NULL from ddt_phys_select().
Apparently "dde" is non-NULL when it gets there, but does not lead
to a valid (expected) lookup.
That would make sense based on your scenario.
So your suggested block with zfs_recover should also be cloned to test
ddp NULLness, right?
Post by George Wilson
I believe I have found the bug in the zio_ddt_write() that affects the
With the verify bit set we calculate the checksum and find that a bp
with that checksum already exists in the DDT so we compare the data and
find that the data is actually different. What is supposed to happen
next is that we disable dedup on the new block and write it out as an
ordinary block. The problem is that we never reset dedup bit in the
block pointer. As a result we have just written a block that is not in
http://code.delphix.com/webrev-2649/
Sounds simple enough to be true ;)

So, what happened in my case, was that when I retried writes of the
fixed block again, the pool was trying to free my first-correct block
with dedup bit set but no corresponding entry in the DDT, and so it
panicked? Is that the correct assessment?

I did not realize that in case of detected collisions, there should
NOT appear a second DDT entry with same checksum and different BP
(i.e. on further verified writes the system should check both
locations now, like with other hash-indexed databases).

In my case, the proper algorithm would lead to the second (noncorrupt)
block write to be stored as non-deduped, and further verified writes
of it would not be deduped either and would clone up identical blocks,
even if/when/after I detect and free the corrupt block by deleting
the file (or receiving just the one block, as I did) - right?

Should or should not this solution also take partial care of bug
#2015 (quoted below)?
For example - in case of collision on write, also check if the
existing on-disk block still matches the checksum; if the block
is corrupted and useless anyway - have it replaced by the new
block which does match the same checksum (write the block, save
its BP into the DDT entry leaving other fields intact).

This way manual repair of one broken deduped file (by copying it
over from backup or another source), would automatically propagate
to all other files which contain this block. And such unfixable
corruption of on-disk userdata seems orders of magnitude more
probable than the sha256 10^-77 hash collision chance ;)
If I'm wrong - where?
Post by George Wilson
Post by Jim Klimov
In neither case does the system detect/report/log the
checksum mismatch, like scrub or even mere "cat" of the
https://www.illumos.org/issues/2015
Even with dedup=verify, the userdata block contents are
being matched upon write, but checksum of on-disk data
is neither recalculated "just for kicks", nor does its
mismatch cause an increase in pool's CKSUM counters.
I wonder HOW it is read to bypass the BP checksumming?
Directly from the DVA? How, lacking the checksum match,
did it try to reconstruct the (corrupt) block from raidz2?
Those are related, but different questions. Hopefully
there would be discussion and fix in that as well ;)
Thanks,
//Jim Klimov


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com
George Wilson
2012-04-24 14:22:27 UTC
Permalink
Post by Jim Klimov
Comments/questions inline below...
Post by George Wilson
Post by Jim Klimov
API functions which start using their parameter as a pointer
(to structure) right away, should test that the pointer
is not NULL. We can fix the zio.c call now, but the same
mistake can be (or appear later) elsewhere in "client" code.
The problem with this approach is that it requires the client to do the
proper thing on error.
...
I grant you that the DDT logic should be more robust but what its tried
to do is take the approach to panic (ie. null pointer dereference)
earlier rather than later.
...
The proposed code above was meant as a substitute to your changes in
zio.c. The idea is that if you know you have corruption (ie. you've
already panicked) you can can set zfs_recover and bypass the free which
is causing the panic (there are other places where zfs_recover does a
similar thing). Of course you end up leaking whatever blocks bypass this
normal zio_ddt_free() logic.
To check that I understood correctly: the code which faces a bad
external problem (such as my inconsistent on-disk data) SHOULD
panic in the API "server" by default so that for future code
builds and written datasets this error condition would be known
and properly avoided (in "client" code)?
This does make sense, though is of little consolence to those
users who already have old code and errors in the on-disk data
structures, and get panic after panic ;)
That's where the change I proposed to comes in:

zio_ddt_free()
<snip>

freedde = dde = ddt_lookup(ddt, bp, B_TRUE);
if (zfs_recover && dde == NULL) {
char *blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_SLEEP);
sprintf_blkptr(blkbuf, bp);
cmn_err(CE_WARN, "Unable to free bp %s\n", blkbuf);
kmem_free(blkbuf, BP_SPRINTF_LEN);
ddt_exit(ddt);
return (ZIO_PIPELINE_CONTINUE);
}

<snip>

Now users who hit this panic would be able to set zfs_recover, leak a
few blocks and be back to running again. Unfortunately this doesn't fix
the busted DDT entry.
Post by Jim Klimov
Post by George Wilson
...
If this change is made into the gate then it just propagates
corruption.
Post by George Wilson
In order to make this change you need to ensure that all consumers of
ddt_phys_addref() know what to do if ddp is NULL and that's a big
project to tackle. :-)
So we'd rather do this one caller at a time, on a per-incident basis? ;)
Well, that hopefully gives us reproducible use-cases to test the
solutions, so sarcasm is partial.
Doing this per-incident allows us to fully understand the problem
(hopefully) and supply the best solution.
Post by Jim Klimov
Post by George Wilson
Post by Jim Klimov
IIRC the problem was with "ddp" being NULL from ddt_phys_select().
Apparently "dde" is non-NULL when it gets there, but does not lead
to a valid (expected) lookup.
That would make sense based on your scenario.
So your suggested block with zfs_recover should also be cloned to test
ddp NULLness, right?
Post by George Wilson
I believe I have found the bug in the zio_ddt_write() that affects the
With the verify bit set we calculate the checksum and find that a bp
with that checksum already exists in the DDT so we compare the data and
find that the data is actually different. What is supposed to happen
next is that we disable dedup on the new block and write it out as an
ordinary block. The problem is that we never reset dedup bit in the
block pointer. As a result we have just written a block that is not in
http://code.delphix.com/webrev-2649/
Sounds simple enough to be true ;)
So, what happened in my case, was that when I retried writes of the
fixed block again, the pool was trying to free my first-correct block
with dedup bit set but no corresponding entry in the DDT, and so it
panicked? Is that the correct assessment?
Yes, that's what I believed happened.
Post by Jim Klimov
I did not realize that in case of detected collisions, there should
NOT appear a second DDT entry with same checksum and different BP
(i.e. on further verified writes the system should check both
locations now, like with other hash-indexed databases).
In my case, the proper algorithm would lead to the second (noncorrupt)
block write to be stored as non-deduped, and further verified writes
of it would not be deduped either and would clone up identical blocks,
even if/when/after I detect and free the corrupt block by deleting
the file (or receiving just the one block, as I did) - right?
If you're able to remove the last reference to the original block then
that should remove the DDT entry. Then you could replace the file and
the DDT should be happy again.
Post by Jim Klimov
Should or should not this solution also take partial care of bug
#2015 (quoted below)?
For example - in case of collision on write, also check if the
existing on-disk block still matches the checksum; if the block
is corrupted and useless anyway - have it replaced by the new
block which does match the same checksum (write the block, save
its BP into the DDT entry leaving other fields intact).
This way manual repair of one broken deduped file (by copying it
over from backup or another source), would automatically propagate
to all other files which contain this block. And such unfixable
corruption of on-disk userdata seems orders of magnitude more
probable than the sha256 10^-77 hash collision chance ;)
If I'm wrong - where?
My change does not address this. I suspect what happened in your case is
that we go into zio_ddt_collision() and attempt to do an arc_read() of
the original block. That should have failed with a checksum error, the
result is that we detect this as a collision. Given the unlikely event
of a sha256 collision, we could have rewritten the block as you
suggested thus fixing the DDT. This is a good RFE to fix up busted DDT
entries.

- George
Post by Jim Klimov
Post by George Wilson
Post by Jim Klimov
In neither case does the system detect/report/log the
checksum mismatch, like scrub or even mere "cat" of the
https://www.illumos.org/issues/2015
Even with dedup=verify, the userdata block contents are
being matched upon write, but checksum of on-disk data
is neither recalculated "just for kicks", nor does its
mismatch cause an increase in pool's CKSUM counters.
I wonder HOW it is read to bypass the BP checksumming?
Directly from the DVA? How, lacking the checksum match,
did it try to reconstruct the (corrupt) block from raidz2?
Those are related, but different questions. Hopefully
there would be discussion and fix in that as well ;)
Thanks,
//Jim Klimov
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
https://www.listbox.com/member/archive/rss/182179/21175078-ef5c1f3a
https://www.listbox.com/member/?&
Powered by Listbox: http://www.listbox.com
-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com

Jim Klimov
2012-04-22 19:26:37 UTC
Permalink
More comments/questions inline below...
Post by George Wilson
Setting aok and zfs_recover would seem to imply that you had to work
past some previous corruption. I would recommend against running your
system with these values always set as it can mask other issue that
could be plaguing your pool.
Ok, I might try to scrub the system with these disabled. Those are the
flags I found in some blogs and maillist entries regarding ZFS crashes,
copied over and mostly avoided ZFS panics since then. I believe that
the logical errors which can optionally cause panics are still logged?
This did not catch the "NULL pointer exception"-caused panic anyways...

One physical problem is that my computer is a PC located in another
country and has no remote management, except a friend that can push
reset on request. And its PC-class hardware may contribute to some
bountiful generation of non-synthetic errors and test-cases ;)
Post by George Wilson
void ddt_phys_addref(ddt_phys_t *ddp)
{
ASSERT(ddp != NULL);
if (ddp) {
ddp->ddp_refcnt++;
}
}
...
To further this example, assume a non-debug system running with this
change and you're writing a new block to a dedup-ed filesystem.
If ddp were null then when you called ddt_phys_addref() from
zio_ddt_write() we would simply drive on without the DDT reference
being updated for this block.
I think I misunderstand how dedup works, which is easy given the
lack of up-to-date docs ;)

I did not realize that in case of detected collisions, there should
NOT appear a second DDT entry with same checksum and different BP
(i.e. on further verified writes the system should check both
locations now, like with other hash-indexed databases).

I *assumed* that when a brand-new entry is written, the userdata
block is added to disk (written to a DVA), a blkptr_t is created
and written as one of many L0 blkptr_t's in the file's indirect
BP tree, and a DDT entry is created pointing to the same DVA.
When a second duplicate is written as part of another file,
I thought that file still has its own BP tree and its L0 blkptr_t
with the same DVA, while the DDT entry is updated with an increased
counter. Likewise, when the file is removed, the block is found
by checksum in DDT, and the DDT entry count is decreased.

IF this logic is correct, then the problem with ddp==NULL is not
very pronounced (I can of course be wrong). I assumed that:
1) upon adding userdata blocks we don't detect a DDT entry and
either add a unique block, or add another DDT entry with
same checksum and current refcount=1 (as I wrote above, this
is what I expected as a normal hash-collision scenario).
2) When userdata blocks (with L0 bp's dedup bit set) are deleted
and DDT entry is not found - as we can detect now - there is
nothing to decrease. A changed/deleted file's L0 bp is removed
and it no longer references the DVA. That's what my fix does...
3) When userdata blocks are deleted and DDT entry is found,
decrease the counter and ultimately free the DDT entry.
However, there are other files' L0 blkptr_t's pointing
to the same DVA so (I thought) the on-disk block is not
leaked. Sooner or later all files with this DVA in their
L0 bp's would be delete and then there would be no
references to the block so it can be freed. Until then
this might seem like a crossover between two or more
files - open for interpretation as either an error or
"like a snapshot/clone" situation.
Post by George Wilson
What has just happened is that you've allowed DDT corruption to
go undetected until some time in the future. When someone frees
this block, ZFS would think that it's freeing the last reference
so it actually returns the block into the free pool.
I am still a bit vague on how the metaslabs reference their
allocated and free disk sectors (and ultimately blocks),
so the scenario you outlined is still quite possible if
it is a bitmap or tree/linked list of byteranges =)
Gotta read up on that, this was in the spec text, "I am not
my best anymore :("...
Post by George Wilson
But wait! There was a reference that we ignored sometime in
the past and now you have a file somewhere that is referencing
a block that doesn't exist. Tracking that down would be a
nightmare!
Agreed... So I should not have bluntly and blindly added
the NULL-check to *every* function, if we ultimately hope
for callers to detect their variables and return values
of functions they call :-)

Is some possible scenario like that for ddt_phys_decref(),
my original troublemaker? So far I guessed that a leak is
possible in that the metaslab would somehow consider the
byterange being busy while there is no dnode blkptr_t in
the pool referencing that byterange.
ZDB has ways to detect leaks; I guess scrubs can/should too?
If so, I think it is not fatal to not-panic in case of not
finding the matching DDT entry when we're trying to free
the block. It might be ultimately freed later during a
regular scrub. Since we've detected an "imperfect situation",
we might give a hint to the system or sysadmin to do the
scrub ASAP, if that would find and deal with leaked blocks.

For example, during a scrub we could find overlapping
files and see if same blocks (DVA, checksum) exist in
DDT and fix the DDT entry.

Likewise, we can detect blocks marked busy in allocation
tables but not referenced by anyone and mark them up for
deletion (or attach them to some pool/lost+found dataset)...

Overall, it is still possible that the on-disk error would
happen with your fix in place, say due to another coding
error, due to older pools like mine which have already
got this condition, due to bit flips in non-ECC RAM on the
box, causing the BP's dedup bit to be set while there is
no matching DDT entry (or the DDT entry getting corrupted),
or somehow else. Now that we know such problem is possible,
there should be ways to fix it.

Would any of this make sense?

Thanks,
//Jim Klimov


-------------------------------------------
illumos-developer
Archives: https://www.listbox.com/member/archive/182179/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182179/21175072-c42b328b
Modify Your Subscription: https://www.listbox.com/member/?member_id=21175072&id_secret=21175072-605db409
Powered by Listbox: http://www.listbox.com
Loading...