Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert in HSH_Lookup with Varnish 7.2.1 #3879

Open
jocel1 opened this issue Dec 7, 2022 · 14 comments
Open

Assert in HSH_Lookup with Varnish 7.2.1 #3879

jocel1 opened this issue Dec 7, 2022 · 14 comments

Comments

@jocel1
Copy link

jocel1 commented Dec 7, 2022

Hi,

I'm using varnish with a malloc allocator and a file allocator for static files.
Settings are the following:

ExecStart=/usr/sbin/varnishd -a :80 -p workspace_session=1024 -p workspace_thread=4096 -p workspace_client=131072 -p workspace_backend=131072 -p listen_depth=16383 -p http_resp_hdr_len=65536 -p http_resp_size=98304 -p workspace_backend=131072 -p thread_pool_min=200 -p thread_pool_max=5000 -p feature=+http2 -P %t/%N/varnishd.pid -T localhost:6082 -f /etc/varnish/default.vcl -s malloc,12288m -s static=file,/tmp/varnish_storage.bin,100G

Expected Behavior

No crash

Current Behavior

I figured out varnish was complaining in the systemctl log.
Active: active (running) since lun. 2022-12-05 20:39:57 CET; 1 day 18h ago

déc. 06 21:25:33 varnishd[2813]: Child (8130) Started
déc. 06 21:25:33 varnishd[2813]: Child (8130) said Child starts
déc. 06 21:25:33 varnishd[2813]: Child (8130) said : Error in munmap(): Invalid argument
déc. 06 21:25:33 varnishd[2813]: Child (8130) said SMF.static mmap'ed 107374182400 bytes of 107374182400
déc. 07 07:59:26 varnishd[2813]: Child (8130) died signal=6
déc. 07 07:59:26 varnishd[2813]: Child (8130) Panic at: Wed, 07 Dec 2022 06:59:26 GMT
Assert error in HSH_Lookup(), cache/cache_hash.c line 426:
Condition((oc)->magic == 0x4d301302) not true....

So the first warning happened a day after starting the server. Then a few hours later, we can see the assert.

This is happening on a ARM architecture (t4g AWS servers) on Amazon Linux 2 + kernel 5.15.75-48.135
I'm using 4 vmod: directors, header, std & xkey
I'm fetching the binary from https://packagecloud.io/install/repositories/varnishcache/varnish72/script.rpm.sh & https://github.com/varnish/varnish-modules/releases/download/0.21.0/varnish-modules-0.21.0.tar.gz

My vcl_hash is not that complicated:

sub vcl_hash {
  # Called after vcl_recv to create a hash value for the request. This is used as a key
  # to look up the object in Varnish.

  # Some generic URL manipulation, useful for all templates that follow
  # First remove the Google Analytics added parameters, useless for our backend
  if (req.url ~ "(\?|&)(mc_[a-z]+|utm_[a-z]+|eff_[a-z0-9]+|fbclid|gclid|cx|ie|cof|siteurl|ranMID|ranEAID|ranSiteID)=") {
    set req.http.hash_url = regsuball(req.url, "(\?|&)(mc_[a-z]+|utm_[a-z]+|eff_[a-z0-9]+|fbclid|gclid|cx|ie|cof|siteurl|ranMID|ranEAID|ranSiteID)=[^&]+", "\1");
    set req.http.hash_url = regsub(req.http.hash_url, "&+$", "");
    set req.http.hash_url = regsub(req.http.hash_url, "&+", "&");
    set req.http.hash_url = regsub(req.http.hash_url, "\?&", "?");
    set req.http.hash_url = regsub(req.http.hash_url, "\?$", "");
  } else {
    set req.http.hash_url = req.url;
  }

  hash_data(req.http.hash_url);

  if (req.http.host) {
    hash_data(req.http.host);
  } else {
    hash_data(server.ip);
  }

  # keep different cache between https and http to avoid infinite loop
  if (req.http.X-Forwarded-Proto) {
     hash_data(req.http.X-Forwarded-Proto);
  }

  if (req.http.X-Requested-With) {
     hash_data(req.http.X-Requested-With);
  }

  return (lookup);
}

Any idea what could be wrong?

Best regards,
Jocelyn Fournier

@dridi
Copy link
Member

dridi commented Dec 7, 2022

No idea what's going on here but the file storage seems to prove highly unreliable these days on aarch64 servers.

Could you try replacing your file storage with a 100GB malloc storage instead? Why did you add another 12GB malloc storage?

And how is /tmp mounted on this system?

@jocel1
Copy link
Author

jocel1 commented Dec 7, 2022

Unfortunately I "only" have 16GB of RAM on this machine. The idea was to split data in two:

  • one storage for all the images (> 1TB) where performances are not critical (there's a CDN in front of varnish), to avoid hitting too much the backend
  • one storage for all the html pages, in memory (16GB should be more than enough for those kind of pages)

/tmp in not in tmpfs but just in the standard disk partition

@jocel1
Copy link
Author

jocel1 commented Dec 8, 2022

Well I'm now running without the file storage, using -s malloc,12288m -s static=malloc,1024m

However it seems I still have random panic issue:

déc. 07 20:13:03 varnishd[2789]: Child launched OK
déc. 07 20:13:03 varnishd[2790]: Child (2802) said Child starts
déc. 07 20:13:03 varnishd[2790]: Child (2802) said <jemalloc>: Error in munmap(): Invalid argument
déc. 07 20:13:03 systemd[1]: Started Varnish Cache, a high-performance HTTP accelerator.
déc. 07 20:52:10 varnishd[2790]: Child (2802) died signal=6
déc. 07 20:52:10 varnishd[2790]: Child (2802) Panic at: Wed, 07 Dec 2022 19:52:10 GMT
                                                                             Wrong turn at cache/cache_main.c:323:
                                                                             Signal 11 (Segmentation fault) received at 0xffff92ffe000 si_code 2...
déc. 07 20:52:10 varnishd[2790]: Child cleanup complete
déc. 07 20:52:10 varnishd[2790]: Child (3704) Started
déc. 07 20:52:10 varnishd[2790]: Child (3704) said Child starts
déc. 07 20:52:10 varnishd[2790]: Child (3704) said <jemalloc>: Error in munmap(): Invalid argument

@jocel1
Copy link
Author

jocel1 commented Dec 8, 2022

Extra backtrace info:

Child (2802) Panic at: Wed, 07 Dec 2022 19:52:10 GMT
Wrong turn at cache/cache_main.c:323:
Signal 11 (Segmentation fault) received at 0xffff92ffe000 si_code 2
version = varnish-7.2.1 revision 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, vrt api = 16.0
ident = Linux,5.15.75-48.135.amzn2.aarch64,aarch64,-junix,-smalloc,-smalloc,-sdefault,-hcritbit,epoll
now = 2399.653874 (mono), 1670442730.667755 (real)
Backtrace:
  0x449230: /usr/sbin/varnishd() [0x449230]
  0x4c41dc: /usr/sbin/varnishd(VAS_Fail+0x58) [0x4c41dc]
  0x443d48: /usr/sbin/varnishd() [0x443d48]
  0xffffa67df820: linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffa67df820]
  0xffffa6472d88: /lib64/libc.so.6(+0x85d88) [0xffffa6472d88]
  0x47fd3c: /usr/sbin/varnishd() [0x47fd3c]
  0x480494: /usr/sbin/varnishd() [0x480494]
  0x437150: /usr/sbin/varnishd(VFP_Suck+0xa4) [0x437150]
  0x432ca4: /usr/sbin/varnishd() [0x432ca4]
  0x435b60: /usr/sbin/varnishd() [0x435b60]
argv = {
  [0] = \"/usr/sbin/varnishd\",
  [1] = \"-a\",
  [2] = \":80\",
  [3] = \"-p\",
  [4] = \"workspace_session=1024\",
  [5] = \"-p\",
  [6] = \"workspace_thread=4096\",
  [7] = \"-p\",
  [8] = \"workspace_client=131072\",
  [9] = \"-p\",
  [10] = \"workspace_backend=131072\",
  [11] = \"-p\",
  [12] = \"listen_depth=16383\",
  [13] = \"-p\",
  [14] = \"http_resp_hdr_len=65536\",
  [15] = \"-p\",
  [16] = \"http_resp_size=98304\",
  [17] = \"-p\",
  [18] = \"workspace_backend=131072\",
  [19] = \"-p\",
  [20] = \"thread_pool_min=200\",
  [21] = \"-p\",
  [22] = \"thread_pool_max=5000\",
  [23] = \"-p\",
  [24] = \"feature=+http2\",
  [25] = \"-P\",
  [26] = \"/run/varnish/varnishd.pid\",
  [27] = \"-T\",
  [28] = \"localhost:6082\",
  [29] = \"-f\",
  [30] = \"/etc/varnish/default.vcl\",
  [31] = \"-s\",
  [32] = \"malloc,12288m\",
  [33] = \"-s\",
  [34] = \"static=malloc,1024m\",
}
pthread.self = 0xffff974671b0
pthread.name = (cache-worker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0xffff97448000,
  stack_top = 0xffff97468000,
  stack_size = 131072,
}
thr.req = NULL
thr.busyobj = 0xffff16fb0020 {
  end = 0xffff16fd0000,
  retries = 0,
  sess = 0xffff33ffc420 {
    fd = 88, vxid = 2689893,
    t_open = 1670442707.678030,
    t_idle = 1670442730.614282,
    ws = 0xffff33ffc460 {
      id = \"ses\",
      {s, f, r, e} = {0xffff33ffc4b8, +96, (nil), +832},
    },
    transport = HTTP/1 {
      state = HTTP1::Proc
    }
    client = x 50916 :80,
    local.endpoint = :80,
    local.socket = a0,
    local.ip = x:80,
    remote.ip = x:50916,
    server.ip = x:80,
    client.ip = x:50916,
  },
  worker = 0xffff974666c8 {
    ws = 0xffff97466750 {
      id = \"wrk\",
      {s, f, r, e} = {0xffff974655d0, +0, (nil), +4088},
    },
    VCL::method = BACKEND_RESPONSE,
    VCL::methods = {BACKEND_FETCH, BACKEND_RESPONSE},
  },
  vfc = 0xffff16fb4f38 {
    failed = 0,
    req = 0xffff16fb0648,
    resp = 0xffff16fb0ac0,
    wrk = 0xffff974666c8,
    oc = 0xffff6b24c340,
    filters = {
      V1F_STRAIGHT = 0xffff16fcd4b0 {
        priv1 = 0xffff16fb5448,
        priv2 = 117346,
        closed = 0
      },
    },
    obj_flags = 0x0,
  },
  filter_list = \"\",
  ws = 0xffff16fb0060 {
    id = \"bo\",
    {s, f, r, e} = {0xffff16fb4f80, +99984, (nil), +110712},
  },
  ws_bo = 0xffff16fb5430,
  http[bereq] = 0xffff16fb0648 {
    ws = 0xffff16fb0060 {
      [Already dumped, see above]
    },
    hdrs {
      \"GET\",
      \"/2436994-large_default/vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.jpg\",
      \"HTTP/1.1\",
      \"X-Forwarded-Proto: https\",
      \"X-Forwarded-Port: 443\",
      \"X-Amzn-Trace-Id: Root=1-6390eeea-3f4ea97408e44db20ed02120\",
      \"x-default-isocode: US\",
      \"x-default-language: en\",
      \"cf-ray: 775fccd8297cacf8-ATL\",
      \"cf-visitor: {\"scheme\":\"https\"}\",
      \"accept: image/webp,image/png,image/svg+xml,image/*;q=0.8,video/*;q=0.8,*/*;q=0.5\",
      \"user-agent: Mozilla/5.0 (iPhone; CPU iPhone OS 15_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/238.1.487893381 Mobile/15E148 Safari/604.1\",
      \"accept-language: en-US,en;q=0.9\",
      \"referer: https://www.design-market.us/248098-vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.html?redirected=true\",
      \"cf-connecting-ip: x\",
      \"cf-pseudo-ipv4: x\",
      \"cf-ipcountry: US\",
      \"cdn-loop: cloudflare\",
      \"X-Forwarded-For: x, x, x\",
      \"Via: 1.1 ip-x.eu-west-1.compute.internal (Varnish/7.2)\",
      \"host: www.design-market.us\",
      \"X-ETag: noxetag\",
      \"Accept-Encoding: gzip\",
      \"hash_url: /2436994-large_default/vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.jpg\",
      \"X-Varnish: 1249115\",
    },
  },
  http[beresp] = 0xffff16fb0ac0 {
    ws = 0xffff16fb0060 {
      [Already dumped, see above]
    },
    hdrs {
      \"HTTP/1.1\",
      \"200\",
      \"OK\",
      \"Server: nginx/1.22.0\",
      \"Date: Wed, 07 Dec 2022 19:52:10 GMT\",
      \"Content-Type: image/jpeg\",
      \"Content-Length: 117346\",
      \"Connection: keep-alive\",
      \"Strict-Transport-Security: max-age=15552001; includeSubDomains;\",
      \"Last-Modified: Wed, 12 Oct 2022 18:45:39 GMT\",
      \"Accept-Ranges: bytes\",
      \"Expires: Fri, 06 Jan 2023 19:52:10 GMT\",
      \"X-Content-Type-Options: nosniff\",
      \"X-XSS-Protection: 1; mode=block\",
      \"X-Frame-Options: sameorigin\",
      \"x-storage: static\",
      \"cache-control: public, max-age=2592000\",
      \"static-url: /2436994-large_default/vintage-multicolor-floral-murano-glass-sommerso-vase-italy-1970s.jpg\",
      \"X-UA-Device: \",
      \"X-Device-Type: \",
      \"X-Country-Code: \",
      \"X-ETag: noxetag\",
      \"Accept-Language: en-US,en;q=0.9\",
    },
  },
  objcore[fetch] = 0xffff6b24c340 {
    refcnt = 3,
    flags = {},
    exp_flags = {refd},
    boc = 0xffff4058cf40 {
      refcnt = 3,
      state = stream,
      vary = (nil),
      stevedore_priv = (nil),
    },
    exp = {1670442730.667573, 2592000.000000, 120.000000, 0.000000},
    objhead = 0xffff1205b060,
    stevedore = 0xffffa5bf0160 (malloc static) {
      Simple = 0xffff15d61f00,
      Obj = 0xffff5d156918 {priv=0xffff5d156910, ptr=0xffff15d61f00, len=744, space=744},
      LEN = 0x0...0,
      VXID = 0x00130f5b,
      FLAGS = 0x08,
      GZIPBITS = 0x0...0,
      LASTMODIFIED = 0x41d8d1c2d4c00000,
      VARY = {len=0, ptr=(nil)},
      HEADERS = {len=624, ptr=0xffff15d61f78},
      Body = 0xffff653920e8 {priv=0xffff653920e0, ptr=0xffff92ff0000, len=0, space=58673},
    },
  },
  http_conn = 0xffff16fb5448 {
    fd = 120 (@0xffff46837944),
    doclose = null(Not Closing)
    ws = 0xffff16fb0060 {
      [Already dumped, see above]
    },
    {rxbuf_b, rxbuf_e} = {0xffff16fb54b0, 0xffff16fb5807},
    {pipeline_b, pipeline_e} = {0xffff16fb5807, 0xffff16fcd4b0},
    content_length = 117346,
    body_status = length,
    first_byte_timeout = 300.000000,
    between_bytes_timeout = 300.000000,
  },
  flags = {do_stream},
  director_req = 0xffffa5a40af0 {
    cli_name = boot.front,
    admin_health = probe, changed = 1670440383.896565,
    type = round-robin {
    },
  },
  director_resp = 0xffffa5a40ac0 {
    cli_name = boot.nlb_ip1,
    admin_health = healthy, changed = 1670440383.895000,
    type = backend {
      conn_pool = 0xffffa5af0600 {
        ident = 0xfd480fbdd018b14c3223a30af2cadc2e300897dfdf1263c3d76a6b7fd3d2069c,
        vrt_endpoint = 0xffffa5a62ea0 {
          ipv4 = 127.0.0.1, port = 8080,
        },
      },
      hosthdr = localhost,
      n_conn = 21,
    },
  },
  vcl[vcl] = 0xffffa5af0380 {
    name = \"boot\",
    busy = 103,
    discard = 0,
    state = auto,
    temp = warm,
    conf = 0xffff95b7e650 {
      syntax = \"40\",
      srcname = {
        [0] = \"/etc/varnish/default.vcl\",
        [1] = \"/etc/varnish/backend.vcl\",
        [2] = \"/etc/varnish/devicedetect.vcl\",
        [3] = \"<builtin>\",
      },
      instances = {
        \"front\" = 0xffffa5a72e00,
      },
    },
  },
  vpi = 0xffff974655c0 {
    handling (VCL::return) = 0x0 (none),
    ref = 161,
    vpi_ref = 0xffff95b7dc90 {
      source = 0 (\"/etc/varnish/default.vcl\"),
      offset = 21167,
      line = 525,
      pos = 3,
      src = \"set beresp.http.Accept-Language = bereq.[...]\"
    },
  },
},
thr.worker = 0xffff974666c8 {
  [Already dumped, see above]
},
vmods = {
  std = {0xffffa5be01d0, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 0.0},
  directors = {0xffffa5be0240, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 0.0},
  header = {0xffffa5be02b0, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 16.0},
  xkey = {0xffffa5be0320, Varnish 7.2.1 399fa7ed46d68f4c4f42a8fd2748339750d99a8b, 0.0},
},
pools = {
  pool = 0xffff9de40100 {
    nidle = 173,
    nthr = 200,
    lqueue = 0
  },
  pool = 0xffff9de40740 {
    nidle = 176,
    nthr = 200,
    lqueue = 0
  },
},

@jocel1
Copy link
Author

jocel1 commented Dec 8, 2022

FYI I've just upgraded the jemalloc lib (from 3.6.0 to 5.3.0), I'm monitoring if it has any impact on this issue (it should at least fix the : Error in munmap(): Invalid argument error)

dridi added a commit to dridi/varnish-cache that referenced this issue Dec 8, 2022
This removes the automatic check made only on Linux systems for jemalloc
now that operating systems we care about ship a much more recent version
than the 3.6 that is known to work really well with Varnish on x86_64
hardware.

Of the platforms we provide packages for, only Ubuntu 18.04 (bionic) and
RHEL 7 (via EPEL) ship jemalloc 3.6.0 and the rest already moved on to a
5.x series.

It appears that jemalloc 3.6 on aarch64 results in unstable Varnish
workloads while jemalloc 5 is known to generate a lot of waste with its
default configuration with a highly threaded workload like Varnish would
operate, based on feedback we have seen over years.

From now on, jemalloc is found via pkg-config, only if it is explicitly
requested at configure time. With explicit opt-in, the Linux-only check
is gone and we solely rely on pkg-config to find the library. This can
of course be overriden as usual at configure time:

    ./configure --with-jemalloc JEMALLOC_CFLAGS=... JEMALLOC_LIBS=...

Or, if you don't like long command lines, via environment variables.

Refs varnishcache#3867
Refs varnishcache#3879
@dridi
Copy link
Member

dridi commented Dec 8, 2022

I would also like to be interested to learn how it fares without jemalloc at all, see #3881.

@jocel1
Copy link
Author

jocel1 commented Dec 8, 2022

Thx, I'll also try without jemalloc at all!

@jocel1
Copy link
Author

jocel1 commented Dec 11, 2022

FYI Varnish has been running with the original -s malloc,12288m -s static=file,/tmp/varnish_storage.bin,100G and jemalloc 5.3.0 without any issue so far!

@dridi
Copy link
Member

dridi commented Dec 19, 2022

Could you please try another build with jemalloc 3.6?

./configure --with-jemalloc --with-unwind --enable-debugging-symbols

And then run varnishd with this jemalloc configuration in your environment:

export MALLOC_CONF="abort:true,redzone:true,junk:true"

@jocel1
Copy link
Author

jocel1 commented Dec 22, 2022

Hi @dridi

Just to be sure, release 7.2.1 should be fine for this test, no need to use develop?

@jocel1
Copy link
Author

jocel1 commented Dec 25, 2022

Hi @dridi

FYI it's not possible to start varnish with jemalloc 3.6 & the debug config on arm because of the munmap issue:

déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Version: varnish-7.2.1 revision NOGIT
déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Platform: Linux,5.10.157-139.675.amzn2.aarch64,aarch64,-junix,-smalloc,-sfile,-sdefault,-hcritbit
déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child (22295) Started
déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child (-1) said Child starts
déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child (-1) said : Error in munmap(): Invalid argument
déc. 25 21:42:30 ip-172-29-97-79.eu-west-1.compute.internal varnishd[22282]: Child failed on launch
déc

With jemalloc 5.3.0 it starts properly (but there's other issues after a while with instability / high CPU usage)

@jocel1
Copy link
Author

jocel1 commented Jan 3, 2023

Hi @dridi

I'm now evaluating the 7.2.1 version compiled without jemalloc (+ varnish-modules) on a arm AWS graviton2 processor (compiled with gcc10 & mcpu=neoverse-n1 flags).
I'll let you know if it has any instability.

@jocel1
Copy link
Author

jocel1 commented Jan 5, 2023

Hi @dridi

FYI 7.2.1 without jemalloc using patch from #3881 seems to work pretty well on AWS arm. I see a few spike in CPU usage during aggressive google crawls I didn't have before (x86_64 instances), but response time are good & stable so it's not really an issue (here it's an x2gd.large with 2 cores and 32GB of RAM)

image

image

@jocel1
Copy link
Author

jocel1 commented Jan 10, 2023

I confirm after more than 1 week of use everything is still really stable on ARM without jemalloc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants