Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core dumps generated in libclntsh when timeouts are configured #165

Open
quidstone opened this issue Oct 15, 2021 · 10 comments
Open

core dumps generated in libclntsh when timeouts are configured #165

quidstone opened this issue Oct 15, 2021 · 10 comments
Labels

Comments

@quidstone
Copy link

  1. What version of ODPI-C are you using (see dpi.h)?
    the version is 4.2.1
  2. What OS (and version) is your application on?
    Red Hat Enterprise Linux Server release 7.9
  3. What compiler (and version) did you use?
    gcc (GCC) 4.8.5
  4. What is your version of the Oracle Client (e.g. Instant Client)? How was it installed? Where it is installed?
    19.11, I installed it as an rpm. installed location is /usr/lib/oracle/19.11/client64
  5. What is your Oracle Database version?
    11.2
  6. What is the PATH environment variable (on Windows) or LD_LIBRARY_PATH (on Linux) set to?
    LD_LIBRARY_PATH is set to the odpi-c lib location
  7. What environment variables did you set? How exactly did you set them?
    application is running as a service.
  8. What problem or error(s) you are seeing?
    We have added timeouts is sqlnet.ora as we face db space issue for a partitioned table quite a lot. After adding that whenever the issue occurs, every time we get few core dumps. I am sharing both core dump backtrace and sqlnet.ora file entry.
    backtrace:(some of the stack is omitted)
Program terminated with signal 11, Segmentation fault.
#0  0x00007fe09b68bd00 in pthread_mutex_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-324.el7_9.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00007fe09b68bd00 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fe097d59323 in kpucopyuserhndl () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#2  0x00007fe097df8142 in kpuspextend () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#3  0x00007fe097dfa5ac in kpuspgetfreesession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#4  0x00007fe097df6f6f in kpuspgetpooledsession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#5  0x00007fe097df304d in kpuspsessionget () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#6  0x00007fe097b33d01 in OCISessionGet () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#7  0x00007fe09b4623a4 in dpiOci__sessionGet (envHandle=0x23bd320, handle=handle@entry=0x7fdfbc00d680, authInfo=authInfo@entry=0x7fe0300391f0,
connectString=connectString@entry=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", connectStringLength=connectStringLength@entry=29, tag=<optimized out>,
tagLength=0, outTag=outTag@entry=0x7fe0707d78c8, outTagLength=outTagLength@entry=0x7fe0707d78d0, found=found@entry=0x7fe0707d78d4, mode=mode@entry=1,
error=error@entry=0x7fe0707d7850) at src/dpiOci.c:3136
#8  0x00007fe09b446f1c in dpiConn__getSession (error=0x7fe0707d7850, authInfo=0x7fe0300391f0, params=0x7fe0707d7870, connectStringLength=29,
connectString=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", mode=1, conn=0x7fdfbc00d660) at src/dpiConn.c:861
#9  dpiConn__get (error=0x7fe0707d7850, pool=0x23bd180, createParams=0x7fe0707d7870, connectStringLength=29,
connectString=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", passwordLength=<optimized out>, password=<optimized out>, userNameLength=<optimized out>,
userName=<optimized out>, conn=0x7fdfbc00d660) at src/dpiConn.c:636
#10 dpiConn__create (conn=0x7fdfbc00d660, context=<optimized out>, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
password=password@entry=0x0, passwordLength=passwordLength@entry=0, connectString=0x23f2088 "OCI:SP:zlx4B3WlO3fgU6hJZQrcuC", connectStringLength=29,
pool=pool@entry=0x23bd180, commonParams=commonParams@entry=0x0, createParams=createParams@entry=0x7fe0707d7870, error=error@entry=0x7fe0707d7850)
at src/dpiConn.c:414
#11 0x00007fe09b450d71 in dpiPool__acquireConnection (pool=pool@entry=0x23bd180, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
password=password@entry=0x0, passwordLength=0, params=params@entry=0x7fe0707d7870, conn=conn@entry=0x7fe0707d7958, error=error@entry=0x7fe0707d7850)
at src/dpiPool.c:36
#12 0x00007fe09b450f1d in dpiPool_acquireConnection (pool=pool@entry=0x23bd180, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
password=password@entry=0x0, passwordLength=passwordLength@entry=0, params=0x7fe0707d7870, params@entry=0x0, conn=conn@entry=0x7fe0707d7958)
at src/dpiPool.c:341

sqlnet.ora file contents:

SQLNET.RECV_TIMEOUT=200ms
SQLNET.OUTBOUND_CONNECT_TIMEOUT=300ms
DIAG_ADR_ENABLED=OFF
DIAG_SIGHANDLER_ENABLED=FALSE
DIAG_DDE_ENABLED=FALSE

  1. Do you have a runnable code snippet to describe the problem?
    no
@quidstone quidstone added the bug label Oct 15, 2021
@cjbj
Copy link
Member

cjbj commented Oct 15, 2021

  • Describe what the program's doing.
  • Try adding the line disable_oob=on to a sqlnet.ora file on the machine where your app is running.
  • Try remove the DIAG_* lines from the sqlnet.ora file.
  • Try with a recent version of the DB.

@quidstone
Copy link
Author

  • Our program is calling a procedure which has an insert command inside. We are using a connection string like this:
host:port/service?CONNECT_TIMEOUT=300ms&RECV_TIMEOUT=200ms&SEND_TIMEOUT=60000ms

We used pooled connection where during startup of the application, it would create some connections with the following pool creation params:

poolCreateParams.maxSessions = maxSessions;
poolCreateParams.minSessions = maxSessions;
poolCreateParams.sessionIncrement = 0;
poolCreateParams.pingTimeout = 1000; //millisec
poolCreateParams.pingInterval = 120;//seconds
poolCreateParams.timeout = 0;

We needed constant amount of connection available at all times. Then we added the pool wait time to operate in a NON-BLOCKING mode . If there is a timeout then it would result in a failure which is okay sometimes. It rarely happens though in our scenario. We have seen we were getting more than 300k request per hour where ~70 reqs were not getting any connection. This is okay at out current deployment.

dpiPool_setGetMode(*pool, DPI_MODE_POOL_GET_TIMEDWAIT); 
dpiPool_setWaitTimeout(*pool, timeout);

But when there is a space issue in DB, then the procedure execution would fail after 2s per connection and give the error unable to extend table. We needed the procedure to finish either successfully or fail within a shorter time period. Usually for successful cases it takes 5ms avg so we tried to add a RECV_TIMEOUT of 200ms for the fail cases to expedite any return from procedure execution. Application behaves mostly the way it should but every time during the issue it generates core dump from a segmentation fault deep inside libclntsh which disrupts the production env and congestion occurs in our network.

  • disable_oob=on will be added
  • We added sqlnet.ora to remove huge files created for diagnostic purposes and added those DIAG_* lines and I was confused that the timeouts in the connection string may be overridden by the sqlnet.ora and added those too to the file. Are diagnostics needed for debugging? We have a sqlnet.log file created where the application is running. It creates mainly the following lines,
***********************************************************************
Fatal NI connect error 12520, connecting to:
 (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=)(CID=(PROGRAM=)(HOST=)(USER=root)))(CONNECT_TIMEOUT=300ms)(RECV_TIMEOUT=200ms)(SEND_TIMEOUT=60000ms)(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))(ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=)))

  VERSION INFORMATION:
        TNS for Linux: Version 19.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
  Version 19.11.0.0.0
  Time: 15-OCT-2021 09:28:29
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12564
    TNS-12564: Message 12564 not found; No message file for product=network, facility=TNS
    ns secondary err code: 12560
    nt main err code: 524
    TNS-00524: Message 524 not found; No message file for product=network, facility=TNS
    nt secondary err code: 115
    nt OS err code: 0


***********************************************************************
Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 19.0.0.0.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 19.0.0.0.0 - Production
  Version 19.11.0.0.0
  Time: 15-OCT-2021 14:13:35
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    TNS-12535: Message 12535 not found; No message file for product=network, facility=TNS
    ns secondary err code: 12609
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=))
  • Our production is running with 11.2 and we don't have any other options right now.

@quidstone
Copy link
Author

now we are getting this,

#0  0x00007f496269b4fb in raise () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-324.el7_9.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 numactl-libs-2.0.12-5.el7.x86_64 systemd-libs-219-78.el7_9.3.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00007f496269b4fb in raise () from /lib64/libpthread.so.0
#1  0x00007f495fe7e1bf in skgesigOSCrash () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#2  0x00007f4960572e3d in kpeDbgSignalHandler () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#3  0x00007f495fe7e492 in skgesig_sigactionHandler () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#4  <signal handler called>
#5  0x00007f495ed62849 in kpucopyuserhndl () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#6  0x00007f495ee02142 in kpuspextend () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#7  0x00007f495ee045ac in kpuspgetfreesession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#8  0x00007f495ee00f6f in kpuspgetpooledsession () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#9  0x00007f495edfd04d in kpuspsessionget () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#10 0x00007f495eb3dd01 in OCISessionGet () from /usr/lib/oracle/19.11/client64/lib/libclntsh.so
#11 0x00007f496246c3a4 in dpiOci__sessionGet (envHandle=0x1685d00, handle=handle@entry=0x7f48c400d680, authInfo=authInfo@entry=0x7f4865fd7618,
    connectString=connectString@entry=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", connectStringLength=connectStringLength@entry=29, tag=<optimized out>,
    tagLength=0, outTag=outTag@entry=0x7f493f7e58c8, outTagLength=outTagLength@entry=0x7f493f7e58d0, found=found@entry=0x7f493f7e58d4, mode=mode@entry=1,
    error=error@entry=0x7f493f7e5850) at src/dpiOci.c:3136
#12 0x00007f4962450f1c in dpiConn__getSession (error=0x7f493f7e5850, authInfo=0x7f4865fd7618, params=0x7f493f7e5870, connectStringLength=29,
    connectString=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", mode=1, conn=0x7f48c400d660) at src/dpiConn.c:861
#13 dpiConn__get (error=0x7f493f7e5850, pool=0x1685b60, createParams=0x7f493f7e5870, connectStringLength=29,
    connectString=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", passwordLength=<optimized out>, password=<optimized out>, userNameLength=<optimized out>,
    userName=<optimized out>, conn=0x7f48c400d660) at src/dpiConn.c:636
#14 dpiConn__create (conn=0x7f48c400d660, context=<optimized out>, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=passwordLength@entry=0, connectString=0x16baa68 "OCI:SP:zpuyRFdHlCPgU6hJZQr8DC", connectStringLength=29,
    pool=pool@entry=0x1685b60, commonParams=commonParams@entry=0x0, createParams=createParams@entry=0x7f493f7e5870, error=error@entry=0x7f493f7e5850)
    at src/dpiConn.c:414
#15 0x00007f496245ad71 in dpiPool__acquireConnection (pool=pool@entry=0x1685b60, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=0, params=params@entry=0x7f493f7e5870, conn=conn@entry=0x7f493f7e5958, error=error@entry=0x7f493f7e5850)
    at src/dpiPool.c:36
#16 0x00007f496245af1d in dpiPool_acquireConnection (pool=pool@entry=0x1685b60, userName=userName@entry=0x0, userNameLength=userNameLength@entry=0,
    password=password@entry=0x0, passwordLength=passwordLength@entry=0, params=0x7f493f7e5870, params@entry=0x0, conn=conn@entry=0x7f493f7e5958)
    at src/dpiPool.c:341
#17 0x0000000000406a5e in dpi__getConnFromPool (pool=0x1685b60) at db_util.c:197

sqlnet.log

SQLNET.RECV_TIMEOUT=200ms
SQLNET.OUTBOUND_CONNECT_TIMEOUT=300ms
DISABLE_OOB=on
ADR_BASE=/data01/oracle_diag

We had exactly this issue before adding DIAG_* params in sqlnet.ora file. Here is a stackoverflow thread which suggested DIAG_*'s addition.

@tgulacsi
Copy link

Maybe

poolCreateParams.sessionIncrement = 1;
poolCreateParams.timeout = 100;

Would be more real-world (what libclntsh writers awaited for)?

@quidstone
Copy link
Author

quidstone commented Oct 18, 2021

from doc here

dpiPoolCreateParams.timeout

Specifies the length of time (in seconds) after which idle sessions in the pool are terminated. Note that termination only occurs when the pool is accessed. The default value is 0 which means that no idle sessions are terminated. This value can be set after the pool has been created using the function dpiPool_setTimeout() and acquired using the function dpiPool_getTimeout().

dpiPoolCreateParams.sessionIncrement

Specifies the number of sessions that will be created by the session pool when more sessions are required and the number of sessions is less than the maximum allowed. This value is ignored if the dpiPoolCreateParams.homogeneous member has a value of 0. This value added to the dpiPoolCreateParams.minSessions member value must not exceed the dpiPoolCreateParams.maxSessions member value. The default value is 0.

I need a constant number of connections available at all time. so I added,

poolCreateParams.maxSessions = maxSessions;
poolCreateParams.minSessions = maxSessions;

Both have the same value assigned. I don't want to terminate idle connections ,so there will be no cost of creating one when needed. It can fail if there is none available within the pool timeout time. So I decided to assign session increment to zero. As there is no termination, max and min has same value assigned to the maxSessions param.

@quidstone
Copy link
Author

@cjbj, Should we consider updating instant client to 19.12? Where should we go from here? Changing DB version is not an option right now.

I am getting the feeling that these tools work better with relatively new versions of oracle DB. But to be honest, this timeout feature is too basic to fail this way. So, I am not sure why it is not working as it should. Is there any other info needed for you guys to understand it better?

@tgulacsi
Copy link

I cannot unsee the similarity with godror/godror#192 (comment) - we se SIGSEGVs when the pool is full (timeouts set), and acquiring new connections concurrently and fast enough.

Limiting the number of sessions on the call side (Go) evades this.

@cjbj
Copy link
Member

cjbj commented Oct 21, 2021

The Go issue could be unrelated, so let's treat the issues separately for now.

I just finished a discussion with the OCI developers. They want to know what's going but, but as expected, a testcase would help diagnose the problem.

We could look at trace. Overall it would be best if you could create an Oracle support request for the problem so "sensitive" files like traces can be shared; I don't know whether you can do this since you have 11.2 DB? It would be channelled through the OCI support team.

@cjbj
Copy link
Member

cjbj commented Oct 28, 2021

@quidstone if you're unable to createa support request (I wonder what your support status of 11.2 is?), I am happy to have a call with you to gather information. Send me email (see my profile).

@quidstone
Copy link
Author

hi @cjbj , I have already created a support request through our db-admin at out company. I admit that things are moving slow. This is more of our problem than anyone else's. We have to provide oracle support team traces from both servers(DB, client) when there is an issue. This is taking more time than it should. Unless I share those traces, I can't really tell what's next. We can talk about it a bit though. I will send you an email. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants