Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization layer seems too permissive with regards to checking the actual types received #3868

Open
Ten0 opened this issue Dec 7, 2023 · 2 comments

Comments

@Ten0
Copy link
Member

Ten0 commented Dec 7, 2023

When the database sends a type that doesn't match the one declared in the schema, we don't check that it matches, and may attempt deserializing from a wire representation of a different type, possibly returning incorrect data without generating an error.

I encountered this scenario while changing the type of a column from Int4 to Int8. I expected to see errors due to mismatches for the time it would take for my new service to deploy after the migration was done running, which I was fine with. (Or at least that it would deserialize the appropriate thing because everybody uses little endian, right?)

What actually happened however is the line went through here:

impl<T, ST, DB> FromStaticSqlRow<ST, DB> for T
where
DB: Backend,
T: FromSql<ST, DB>,
ST: SingleValue,
{
fn build_from_row<'a>(row: &impl Row<'a, DB>) -> Result<Self> {
use crate::row::Field;
let field = row.get(0).ok_or(crate::result::UnexpectedEndOfRow)?;
T::from_nullable_sql(field.value())
}
}

Then through here:
#[cfg(feature = "postgres_backend")]
impl FromSql<sql_types::Integer, Pg> for i32 {
fn from_sql(value: PgValue<'_>) -> deserialize::Result<Self> {
let mut bytes = value.as_bytes();
debug_assert!(
bytes.len() <= 4,
"Received more than 4 bytes decoding i32. \
Was a BigInt expression accidentally identified as Integer?"
);
debug_assert!(
bytes.len() >= 4,
"Received fewer than 4 bytes decoding i32. \
Was a SmallInt expression accidentally identified as Integer?"
);
bytes
.read_i32::<NetworkEndian>()
.map_err(|e| Box::new(e) as Box<_>)
}
}

Note that in this second part:

  • the length checks are debug_assert! (I wouldn't want it to panic anyway, I would want a deserialization error instead)
  • NetworkEndian is actually BigEndian (I just imagined that PG uses little endian since that's the one actually used on all machines so that's one more reason why I didn't expect this issue ^^')

Anyway so of course, with big endian and ignoring the size mismatch, diesel ended up deserializing only the leading zeroes of my Int8 into an Int4, causing it to send a bunch of zeroes into my applicative code.

That doesn't seem like the ideal behavior. Instead, one would expect that if the Rust schema.rs of the currently deployed service doesn't match the actual database schema, corresponding queries error.

It seems that we could prevent that by checking the consistency between the database-provided OID:

pub fn get_oid(&self) -> NonZeroU32 {

and our known OID for the type:
#[diesel(postgres_type(oid = 23, array_oid = 1007))]
#[diesel(sqlite_type(name = "Integer"))]

There may be several ways to make that safety check cost zero performance, one of them seems to be to do it only once when building the LoadIter (or on first row, or within the connection implementation using QueryMetadata), since every row has the same set of oids.

#[allow(missing_debug_implementations)]
pub struct LoadIter<U, C, ST, DB> {
pub(super) cursor: C,
pub(super) _marker: std::marker::PhantomData<(ST, U, DB)>,
}
impl<'a, C, U, ST, DB, R> LoadIter<U, C, ST, DB>
where
DB: Backend,
C: Iterator<Item = QueryResult<R>>,
R: crate::row::Row<'a, DB>,
U: FromSqlRow<ST, DB>,
{
pub(super) fn map_row(row: Option<QueryResult<R>>) -> Option<QueryResult<U>> {
match row? {
Ok(row) => Some(
U::build_from_row(&row).map_err(crate::result::Error::DeserializationError),
),
Err(e) => Some(Err(e)),
}
}
}

Side note: could there be users that rely on a mismatch between their schema.rs and the actual OIDs provided by the database? (Like weird custom text types with case insensitive & co. being declared as just Text in schema.rs?) If so (or even in any case), we should at least check the size of the things we attempt to deserialize. (An additional single if value.len() != 4 { return <some deserialization error> } would cost exactly nothing here since there has to be bounds checking going on in bytes.read_i32 already, which we could then avoid, or we could just check that bytes is empty after reading which would only be two instructions anyway so probably wouldn't show in any benchmark...)

@weiznich
Copy link
Member

weiznich commented Dec 8, 2023

This is a somewhat known issue, it's just something that happens somewhat rarely that no-one really cared about improving this case. There is quite a bit to note here:

The first thing is that PgValue already includes the type oid: https://docs.diesel.rs/2.1.x/diesel/pg/struct.PgValue.html#method.get_oid. So we could just check that we get the right value as part of all the FromSql implementations. We decided to not to do that back then when we introduced that abstraction, because we might call other FromSql implementations internally for composite types/arrays/…. See this PR for more details #1833 See for example here for such a case:

T::from_sql(PgValue::new_internal(elem_bytes, &value))

Now we could go and perform a check at a higher level as you suggested. I feel that might be problematic as well. You are right that we get the right oids for the result set as basically no cost. The problematic part is getting the expected oids from our side of types. That might require a database query for custom types, which is not cheap. (We would basically need to call this function: https://docs.diesel.rs/2.1.x/diesel/expression/trait.QueryMetadata.html#tymethod.row_metadata)

Side note: could there be users that rely on a mismatch between their schema.rs and the actual OIDs provided by the database? (Like weird custom text types with case insensitive & co. being declared as just Text in schema.rs?)

I'm aware of at least the following cases:

  • Calling inner FromSql functions as part of FromSql implementations of complex types (array, range, composites, …)
  • Several text types (diesel treads TEXT and VARCHAR as equivalent)

An additional single if value.len() != 4 { return } would cost exactly nothing here since there has to be bounds checking going on in bytes.read_i32 already, which we could then avoid, or we could just check that bytes is empty after reading which would only be two instructions anyway so probably wouldn't show in any benchmark...

That's a good idea. We should do that.

Overall its a unclear for me how exactly to fix the complete problem. Maybe we could approach it the other way around and provide a check function that allows to check the existing schema.rs file against the currently connected database? That would be still somewhat optional, but it would allow to perform that check once.

@Ten0
Copy link
Member Author

Ten0 commented Mar 29, 2024

Side note: if we implement OID checking, we may still want to allow hacks where we put wrong types in the schema for types that can reasonably get coalesced: #3973 (comment)

(Or not: if the database schema changes during a migration but non-tz-aware wasn't UTC, it is better to throw an error...)

(The over-engineered solution would be to have OID-overriding SQL types to make the hack a feature.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants