Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries on indexed char(n) column in PostgreSQL use sequential scan #1912

Open
2 tasks done
ivan opened this issue Nov 10, 2018 · 4 comments
Open
2 tasks done

Queries on indexed char(n) column in PostgreSQL use sequential scan #1912

ivan opened this issue Nov 10, 2018 · 4 comments

Comments

@ivan
Copy link

ivan commented Nov 10, 2018

Setup

Versions

  • Rust: rustc 1.30.1 (1433507eb 2018-11-07)
  • Diesel: 1.3.3
  • Database: postgresql-9.6 9.6.10-0+deb9u1 OR postgresl-11 11.1-1.pgdg90+1 (similar results)
  • Operating System Debian 9.5

Feature Flags

  • diesel: postgres

Problem Description

My table has a primary key of type char(11). When querying this table, Diesel sends over a parameterized query and a value of type text instead of bpchar. This forces a slow sequential scan instead of doing an index-only scan. When running a similar query from psql, PostgreSQL uses the index.

What are you trying to accomplish?

Find an indexed char(n) quickly

What is the expected output?

results in tens of milliseconds

What is the actual output?

results in 1+ second

Are you seeing any additional errors?

No

Steps to reproduce

I think all you need to observe this is an indexed char(n) column, a .find( or .filter(id.eq( in Diesel, and auto_explain on the server side to confirm sequential scans. But I have a complete test case set up to demonstrate this:

For the sake of completeness, I did this on a new Debian 9.5:

apt-get update
apt-get install --no-install-recommends libpq-dev curl git gcc libc-dev
adduser user
su user
curl https://sh.rustup.rs -sSf | sh
cargo install diesel_cli --no-default-features --features postgres

then

su postgres
createuser -S -D -R -P diesel_bpchar
createdb -O diesel_bpchar diesel_bpchar
git clone https://github.com/ludios/diesel_bpchar
cd diesel_bpchar
cargo build

echo "DATABASE_URL=postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar" > .env
diesel setup
diesel migration run

for i in {00000000001..0005000000}; do echo $i >> /tmp/ids; done
psql postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar -c "COPY videos FROM STDIN" < /tmp/ids

What should be a quick index lookup instead takes 1 second:

# time ./target/debug/find_video 00005000000
Found 1 videos
./target/debug/find_video 00005000000  0.01s user 0.00s system 1% cpu 0.998 total

With these postgresql server settings, you can see what is happening on the server side:

shared_preload_libraries = 'auto_explain'       # (change requires restart)

auto_explain.log_min_duration = 0
auto_explain.log_analyze = true
auto_explain.log_verbose = true

log_statement = 'all'           # none, ddl, mod, all

(note: auto_explain is in the postgresql-contrib package on Debian.)

When Diesel sends over its parameterized query, PostgreSQL logs a sequential scan:

2018-11-10 17:35:00.414 UTC [29344] diesel_bpchar@diesel_bpchar LOG:  execute __diesel_stmt_0: SELECT "videos"."id" FROM "videos" WHERE "videos"."id" = $1
2018-11-10 17:35:00.414 UTC [29344] diesel_bpchar@diesel_bpchar DETAIL:  parameters: $1 = '00005000000'
2018-11-10 17:35:01.390 UTC [29344] diesel_bpchar@diesel_bpchar LOG:  duration: 976.167 ms  plan:
    Query Text: SELECT "videos"."id" FROM "videos" WHERE "videos"."id" = $1
    Seq Scan on public.videos  (cost=0.00..102028.00 rows=25000 width=12) (actual time=976.154..976.154 rows=1 loops=1)
      Output: id
      Filter: ((videos.id)::text = '00005000000'::text)
      Rows Removed by Filter: 4999999

Run a similar query from psql and PostgreSQL logs an Index Only Scan:

# time psql postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar -c "SELECT videos.id FROM videos WHERE videos.id = '00005000000';"
     id
-------------
 00005000000
(1 row)

psql postgres://diesel_bpchar:PASS@localhost:5432/diesel_bpchar   0.03s user 0.01s system 91% cpu 0.041 total
2018-11-10 17:39:55.150 UTC [31473] diesel_bpchar@diesel_bpchar LOG:  statement: SELECT videos.id FROM videos WHERE videos.id = '00005000000';
2018-11-10 17:39:55.150 UTC [31473] diesel_bpchar@diesel_bpchar LOG:  duration: 0.033 ms  plan:
        Query Text: SELECT videos.id FROM videos WHERE videos.id = '00005000000';
        Index Only Scan using videos_pkey on public.videos  (cost=0.43..8.45 rows=1 width=12) (actual time=0.028..0.028 rows=1 loops=1)
          Output: id
          Index Cond: (videos.id = '00005000000'::bpchar)
          Heap Fetches: 1

Checklist

  • I have already looked over the issue tracker for similar issues.
  • This issue can be reproduced on Rust's stable channel. (Your issue will be
    closed if this is not the case)
@weiznich
Copy link
Member

weiznich commented Nov 12, 2018

Possible workarounds:

  • Change the type of your column to Text if possible
  • Define your own SqlType that maps to Char(n). See this testcase for an example (Just replace that enum by string, also you probably want to use #[postgres(oid = "18", array_oid = "1002")] instead of #[postgres(type_name = "char")]).

@ivan
Copy link
Author

ivan commented Nov 12, 2018

Interesting, thanks for that SqlType workaround. I was using .filter(sql(&format!("id = '{}'", id))), which required me to deal with SQL injection.

@bnewbold
Copy link

FWIW, I ran in to this exact problem, and an additional warning was added to the PostgreSQL wiki's "Don't Do This" list: https://wiki.postgresql.org/wiki/Don%27t_Do_This#Why_not.3F_12

I switched to TEXT with a constraint. I suppose it's still the right thing for diesel to send bpchar in this case.

@sgrif
Copy link
Member

sgrif commented Jan 4, 2019

There's a few deeper issues here. The history behind all of this is a bit scattered, so let me give some recap/context here.

A very long time ago, Diesel had Varchar and Text as separate types. This was painful for a variety of reasons. While they are treated as separate types in PG (though varchar actually isn't anymore, if you explain any query that involves varchar, you'll see it's always unconditionally cast to text), PG has implicit coercions that make them mostly interchangeable.

However, we can't really model those coercions easily in Diesel, due to a combination of limitations of Rust, and our architecture. There's no easy way for us to say impl<T: AsExpression<Text>> AsExpression<VarChar> for T. So this would lead to really funky things like being unable to write text_col.eq(varchar_col), having functions like lower only work with one type but not the other, and a ton of duplication in our own codebase.

We fixed this in #288 by just making varchar a type alias. This was a workaround, not a fix for the root problem. We've since encountered similar issues that cannot use this workaround. Numeric types are a major case where SQL coerces, but we cannot (it's much less common for this to be painful in practice though). Timestamp vs timestamptz is another case where the types coerce between each other, but do not have identical semantics, and our "just use a type alias" solution would not work.

However, strings are special. This really does (mostly) work for pretty much all string types. These types all have identical representation (contrary to what some folks believe, using char, varchar, bpchar, etc have no benefit over using text. They're all just bytea under the hood). Additionally, all these types have the same set of Rust types that correspond to them. String and str.

However, this has come back to bite us in a few ways. The first one we encountered was that these coercions do not apply to arrays. You cannot use text[] where a varchar[] is expected. This is why we warn in diesel print-schema and friends if we encounter a column of that type.

The second issue we encountered is that text always "wins" in these coercions. When PG encounters some_string_col = ''::text or ''::text = some_string_col, it will cast some_string_col to text, rather than the text to whatever type that column is. We first ran into this when we attempted to use this same type alias workaround to citext. When we sent a bind parameter as text, it would do a case sensitive comparison rather than an insensitive one. The case here is similar.

With all that said, I still don't really know how we can fix the "root" problem here. The best I've ever come up with is adding explicit casts for this, which at least gives us an out, but would still make splitting up the string types extremely painful.

That said, I do still believe that strings are "special", and I think we might be able to improve the situation here for specifically strings. You don't actually have to know the OID of the type you're transmitting in every case, you can also send 0, which means "interpret this as an untyped string literal". That might actually be exactly what we want for bind parameters sent for a string type. (Note: an implementation doing this will not be as simple as just changing the OID of Text to 0. There are still cases where we need the actual OID, particularly when sending it as part of an array or composite type)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants