snapshot action applied during policy fails #453

nateww · 2021-06-04T21:08:53Z

Describe the bug
If a snapshot action is applied via a policy, when that action runs, it will fail to work properly. The same error exists whether using a local FS or using S3, but it's easier to reproduce the issue running locally in docker-compose snapshotting to a local FS.

To Reproduce
Simple docker-compose file to create a cluster

x-es: &es-defaults
    image: amazon/opendistro-for-elasticsearch:1.11.0
    networks:
      - es-net
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536

version: '3'
services:
  es-node1:
    <<: *es-defaults
    container_name: es-node1
    ports:
      - 9200:9200
      - 9600:9600
    environment:
      - node.name=es-node1
      - cluster.initial_master_nodes=es-node1,es-node2
      - cluster.name=es-cluster
      - discovery.seed_hosts=es-node1,es-node2
      - path.repo=/mnt/snapshots
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    volumes:
      - es-data1:/usr/share/elasticsearch/data
      - ./snapshots:/mnt/snapshots

  es-node2:
    <<: *es-defaults
    container_name: es-node2
    environment:
      - node.name=es-node2
      - cluster.initial_master_nodes=es-node1,es-node2
      - cluster.name=es-cluster
      - discovery.seed_hosts=es-node1,es-node2
      - path.repo=/mnt/snapshots
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    volumes:
      - es-data2:/usr/share/elasticsearch/data
      - ./snapshots:/mnt/snapshots

  kibana:
    image: amazon/opendistro-for-elasticsearch-kibana:1.11.0
    container_name: es-kibana
    ports:
      - 5601:5601
    expose:
      - "5601"
    networks:
      - es-net
    environment:
      ELASTICSEARCH_URL: https://es-node1:9200
      ELASTICSEARCH_HOSTS: https://es-node1:9200

volumes:
  es-data1:
  es-data2:

networks:
  es-net:

run docker compose up -d, and then wait for the cluster to startup.
login to the cluster, and generate something to backup with Sample Data (I chose Flight Data, but it shouldn't matter).
Select Management -> Dev Tools, and run the following command:

PUT _snapshot/local_repository
{
  "type": "fs",
  "settings": {
    "location": "/mnt/snapshots"
  }
}

(It should return { "acknowledged" : true}.)

Run a manual snapshot to verify the setup is working properly, running the following command in the console:

PUT _snapshot/local_repository/test

(It should return { "accepted" : true}.)
5. Verify the contents in the snapshots folder where docker-compose was run.

$ ls snapshots
index-0                                 meta-n4cIosy2RgW-1hZUaIYGew.dat
index.latest                            snap-n4cIosy2RgW-1hZUaIYGew.dat
indices/

Blow away the contents of the manually snapshot

$ rm -rf snapshots/*

Create a new ISM policy, similar to the following

{
    "policy": {
        "description": "An index policy to slowly reduce load on server and eventually delete after a set period of time",
        "schema_version": 1,
        "default_state": "snapshot",
        "states": [
            {
                "name": "snapshot",
                "actions": [
                    {
                        "snapshot": {
                            "repository": "local_repository",
                            "snapshot": "snapshot"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "hot",
                        "conditions": {
                            "min_doc_count": 1
                        }
                    }
                ]
            },
            {
                "name": "hot",
                "actions": [],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "1d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "force_merge": {
                            "max_num_segments": 1
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "delete": {}
                    }
                ],
                "transitions": []
            }
        ]
    }
}

Select Management -> Index Management -> Index Policies, and paste above policy in, and name it something like test_policy.
8. Apply the above policy to the existing test index.
Select Management -> Index Management -> Indices, and then select the sample test index (kibana_sample_data_flights), Click on Apply Policy, and then in the dropdown, select the policy you created and named in the prior step.
9. Wait for the policy to be applied to the index, initiate the snapshot action, and then fail.

Expected behavior
The snapshot should complete and not error out.

The log output from the snapshot failure is:

es-node2    | [2021-06-04T21:05:01,767][INFO ][c.a.o.i.i.ManagedIndexRunner] [es-node2] Executing attempt_snapshot for kibana_sample_data_flights
es-node2    | [2021-06-04T21:05:01,790][INFO ][o.e.s.SnapshotsService   ] [es-node2] snapshot [local_repository:snapshot-2021.06.04-21:05:01.768/uKIFe5apSQuPWv6tgvJDHA] started
es-node2    | [2021-06-04T21:05:01,790][INFO ][c.a.o.i.i.ManagedIndexRunner] [es-node2] Finished executing attempt_snapshot for kibana_sample_data_flights
es-node2    | [2021-06-04T21:05:01,811][WARN ][o.e.s.SnapshotShardsService] [es-node2] [[kibana_sample_data_flights][0]][local_repository:snapshot-2021.06.04-21:05:01.768/uKIFe5apSQuPWv6tgvJDHA] failed to snapshot shard
es-node2    | java.nio.file.NoSuchFileException: /mnt/snapshots/indices/Bw2it5LsRzmdIXZs20ZvYw/0/index-PyEmzKxkQlqc62OsbfTmYA
es-node2    |    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
es-node2    |    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
es-node2    |    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
es-node2    |    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:219) ~[?:?]
es-node2    |    at java.nio.file.Files.newByteChannel(Files.java:375) ~[?:?]
es-node2    |    at java.nio.file.Files.newByteChannel(Files.java:426) ~[?:?]
es-node2    |    at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:420) ~[?:?]
es-node2    |    at java.nio.file.Files.newInputStream(Files.java:160) ~[?:?]
es-node2    |    at org.elasticsearch.common.blobstore.fs.FsBlobContainer.readBlob(FsBlobContainer.java:155) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:131) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:117) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:2280) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1811) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:340) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:256) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
es-node2    |    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
es-node2    |    at java.lang.Thread.run(Thread.java:832) [?:?]
es-node2    | [2021-06-04T21:05:01,942][WARN ][o.e.s.SnapshotsService   ] [es-node2] [local_repository:snapshot-2021.06.04-21:05:01.768/uKIFe5apSQuPWv6tgvJDHA] failed to finalize snapshot
es-node2    | org.elasticsearch.snapshots.SnapshotException: [local_repository:snapshot-2021.06.04-21:05:01.768/uKIFe5apSQuPWv6tgvJDHA] failed to update snapshot in repository
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$36(BlobStoreRepository.java:1019) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$8.onResponse(BlobStoreRepository.java:1663) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$8.onResponse(BlobStoreRepository.java:1660) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$4.clusterStateProcessed(BlobStoreRepository.java:1416) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:534) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:421) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at java.util.ArrayList.forEach(ArrayList.java:1510) [?:?]
es-node2    |    at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:421) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:281) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:273) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:250) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:651) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
es-node2    |    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
es-node2    |    at java.lang.Thread.run(Thread.java:832) [?:?]
es-node2    | Caused by: org.elasticsearch.repositories.RepositoryException: [local_repository] concurrent modification of the index-N file, expected current generation [3] but it was not found in the repository
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.ensureSafeGenerationExists(BlobStoreRepository.java:1657) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$writeIndexGen$54(BlobStoreRepository.java:1575) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:226) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at java.util.ArrayList.forEach(ArrayList.java:1510) ~[?:?]
es-node2    |    at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:127) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:62) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:40) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$writeIndexGen$52(BlobStoreRepository.java:1564) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1]
es-node2    |    ... 3 more

And, if you look in the folder, the file does not exist. This feels like a race between when the filename is generated, and when the contents are written, but I'm just guessing.

Note, as mentioned, almost identical errors are produced when the same steps are applied to an S3 repository. Manual snapshots work fine, but snapshots created as an action fail everytime.

Desktop (please complete the following information):

OS: MacOS (although, we're running in docker)
Chrome (shouldn't atter)
Version: 1.11.0 (latest version deployed in AWS)

Additional context
the docker-compose file is based on the examples provided below, and the steps used are copied almost verbatim from the examples provided.

The text was updated successfully, but these errors were encountered:

nateww added the bug Something isn't working label Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshot action applied during policy fails #453

snapshot action applied during policy fails #453

nateww commented Jun 4, 2021 •

edited

snapshot action applied during policy fails #453

snapshot action applied during policy fails #453

Comments

nateww commented Jun 4, 2021 • edited

nateww commented Jun 4, 2021 •

edited