Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job disk resource limits are not honoured #3653

Closed
simonwo opened this issue Mar 19, 2024 · 5 comments · Fixed by #4006
Closed

Job disk resource limits are not honoured #3653

simonwo opened this issue Mar 19, 2024 · 5 comments · Fixed by #4006
Labels
good first issue Good for newcomers type/bug Type: Something is not working as expected

Comments

@simonwo
Copy link
Contributor

simonwo commented Mar 19, 2024

Constraints: []
Count: 1
Labels: {}
Name: cd1071fc-3533-4d6b-8cd1-17fb75731096
Namespace: d701e80998588caaa25edf4c2ac1b7269fd73a3bd7eb130ea4cb5158387f8f56
Priority: 0
Revision: 3
Tasks:
  - Engine:
      Params:
        Entrypoint: [ls]
        EnvironmentVariables: []
        Image: ubuntu
        Parameters:
        - "-lah"
        - /data.sqlite
        WorkingDirectory: ""
      Type: docker
    Name: main
    Network:
      Type: None
    Publisher:
      Type: local
    Resources:
      Disk: 100Kb
    InputSources:
    - Source:
        Type: urldownload
        Params:
          URL: https://data.api.trade.gov.uk/v1/datasets/uk-tariff-2021-01-01/versions/v4.0.176/data?format=sqlite&download
      Target: /data.sqlite
    ResultPaths:
    - Name: outputs
      Path: /outputs
    Timeouts:
      ExecutionTimeout: 1800
Type: batch
Version: 1

This task should fail because the requested URL download is larger than 100Kb. But it succeeds.

total 2.3G
drwxr-xr-x 2 root root 4.0K Mar 19 05:51 .
drwxr-xr-x 1 root root 4.0K Mar 19 05:51 ..
-rw-r--r-- 1 root root 2.3G Mar 19 05:51 data
@simonwo simonwo added the type/bug Type: Something is not working as expected label Mar 19, 2024
@simonwo
Copy link
Contributor Author

simonwo commented Mar 19, 2024

Also, the Target path is ignored.

@wdbaruni wdbaruni added the good first issue Good for newcomers label Apr 16, 2024
@udsamani
Copy link
Collaborator

udsamani commented May 2, 2024

I did a bit of digging for this.

When a AskBidRequest is received by a compute node, we run bidding for the job. The interesting thing is when computing the disk usage in DiskUsageCalculator, we do not take into consideration the parsedResources.

Ideally the totalDiskRequirements (obtained by computing the size from StorageProvider) should match the parsedUsage resources. If they are not the same the compute node should ideally reject the job request.

@wdbaruni what do you think ?

@frrist
Copy link
Member

frrist commented May 2, 2024

I'd recommend exploring the StorageProvider implementation used for fetching URL storage sources: https://github.com/bacalhau-project/bacalhau/blob/main/pkg/storage/url/urldownload/storage.go#L79.

It's worth noting that the size of the content may not always be readily available or accurately reported by the server - as the comment in the above link suggests.

One way we can (try to) mitigate this is by aborting downloads that exceed the content length specified by the server.

@udsamani
Copy link
Collaborator

udsamani commented May 2, 2024

Indeed completely agree to you.

However, what I was pointing to was that there are no checks overall. Lets say you replace the urldownload type to s3 . Now even in that case the job would pass right ? As ideally there should be a comparison of actual input size and one provided in the job spec.

I may be wrong about this. But i don't see any such checks so. If you feel there are such checks could you please point me to it ? Also for s3 we are successfully fetching the size in all scenarios.

func (s *StorageProvider) GetVolumeSize(ctx context.Context, volume models.InputSource) (uint64, error) {
	ctx, cancel := context.WithTimeout(ctx, config.GetVolumeSizeRequestTimeout())
	defer cancel()

	source, err := s3helper.DecodeSourceSpec(volume.Source)
	if err != nil {
		return 0, err
	}

	client := s.clientProvider.GetClient(source.Endpoint, source.Region)
	objects, err := s.explodeKey(ctx, client, source)
	if err != nil {
		return 0, err
	}
	size := uint64(0)
	for _, object := range objects {
		size += uint64(object.size)
	}
	return size, nil
}

@udsamani
Copy link
Collaborator

udsamani commented May 2, 2024

Thank you for clarifying offline. As we discussed. The checks are happening via the interface ShouldBidBasedOnUsage. So the problem at hand is to update the urldownload storage provider, to return the size of the download, which may not be readily available.

Will update once i have an approach for it.

udsamani added a commit to udsamani/bacalhau that referenced this issue May 6, 2024
udsamani added a commit to udsamani/bacalhau that referenced this issue May 7, 2024
udsamani added a commit to udsamani/bacalhau that referenced this issue May 13, 2024
@udsamani udsamani linked a pull request May 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers type/bug Type: Something is not working as expected
Projects
Status: Done
4 participants