Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtModel reads some forcings multiple times #354

Open
veenstrajelmer opened this issue Oct 3, 2022 · 4 comments · May be fixed by #641
Open

ExtModel reads some forcings multiple times #354

veenstrajelmer opened this issue Oct 3, 2022 · 4 comments · May be fixed by #641
Assignees
Labels
priority: high type: bug Something isn't working

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Oct 3, 2022

Describe the bug
When reading in an ExtModel, I see some duplicate forcings.

The ext file contains:

[boundary]
quantity=dischargebnd
locationfile=../../../boundary_conditions/Lek.pli
forcingfile=../../../boundary_conditions/flow/rmm_rivdis_meas_20070101_20140101_MET.bc

[boundary]
quantity=dischargebnd
locationfile=../../../boundary_conditions/Waal.pli
forcingfile=../../../boundary_conditions/flow/rmm_rivdis_meas_20070101_20140101_MET.bc

[boundary]
quantity=dischargebnd
locationfile=../../../boundary_conditions/Maas.pli
forcingfile=../../../boundary_conditions/flow/rmm_rivdis_meas_20070101_20140101_MET.bc

[boundary]
quantity=salinitybnd
locationfile=../../../boundary_conditions/rmm_zeerand_v3.pli
forcingfile=../../../boundary_conditions/transport/rmm_zeesal_30years_v3.bc

[boundary]
quantity=temperaturebnd
locationfile=../../../boundary_conditions/rmm_zeerand_v3.pli
forcingfile=../../../boundary_conditions/transport/rmm_zeetemp_astronomic_v2.bc

ExtModel has five 'forcings' but the first three bc models all have three forcingobjects, which are the same for all three bc models:

boundary 1 of 5: ..\..\..\boundary_conditions\flow\rmm_rivdis_meas_20070101_20140101_MET.bc
forcing 1 of 3: Lek_0001 (timeseries) (dischargebnd)
forcing 2 of 3: Waal_0001 (timeseries) (dischargebnd)
forcing 3 of 3: Maas_0001 (timeseries) (dischargebnd)
boundary 2 of 5: ..\..\..\boundary_conditions\flow\rmm_rivdis_meas_20070101_20140101_MET.bc
forcing 1 of 3: Lek_0001 (timeseries) (dischargebnd)
forcing 2 of 3: Waal_0001 (timeseries) (dischargebnd)
forcing 3 of 3: Maas_0001 (timeseries) (dischargebnd)
boundary 3 of 5: ..\..\..\boundary_conditions\flow\rmm_rivdis_meas_20070101_20140101_MET.bc
forcing 1 of 3: Lek_0001 (timeseries) (dischargebnd)
forcing 2 of 3: Waal_0001 (timeseries) (dischargebnd)
forcing 3 of 3: Maas_0001 (timeseries) (dischargebnd)
boundary 4 of 5: ..\..\..\boundary_conditions\transport\rmm_zeesal_30years_v3.bc
forcing 1 of 6: rmm_zeerand_0001 (timeseries) (salinitybnd)
forcing 2 of 6: rmm_zeerand_0032 (timeseries) (salinitybnd)
forcing 3 of 6: rmm_zeerand_0033 (timeseries) (salinitybnd)
forcing 4 of 6: rmm_zeerand_0110 (timeseries) (salinitybnd)
forcing 5 of 6: rmm_zeerand_0111 (timeseries) (salinitybnd)
forcing 6 of 6: rmm_zeerand_0147 (timeseries) (salinitybnd)
boundary 5 of 5: ..\..\..\boundary_conditions\transport\rmm_zeetemp_astronomic_v2.bc
forcing 1 of 1: rmm_zeerand_0001 (astronomic) (temperaturebnd amplitude)

I guess this happens because the bc file contains data for three rivers, which are coupled to one pli in three blocks. However, I would say this is not expected behaviour.

To Reproduce
The dependency dfm_tools is optional but it helps with plotting.

from pathlib import Path
from hydrolib.core.dflowfm.mdu.models import ExtModel
import matplotlib.pyplot as plt
plt.close('all')

use_dfm_tools = False
if use_dfm_tools:
    from dfm_tools.hydrolib_helpers import forcingobject_to_dataframe

file_extnew = Path(r'p:\archivedprojects\11206813-006-kpp2021_rmm-2d\C_Work\31_RMM_FMmodel\computations\model_setup\run_206_HYDROLIB\RMM_bnd.ext') # was RMM_bnd_5bnds.ext
ext = ExtModel(file_extnew)

ext_boundaries = ext.boundary
for iEB, extbnd in enumerate(ext_boundaries):
    extbnd_filepath = extbnd.forcingfile.filepath
    print(f'boundary {iEB+1} of {len(ext_boundaries)}: {extbnd_filepath}')
    extbnd_forcings = extbnd.forcingfile.forcing
    fig,ax = plt.subplots(figsize=(12,6))
    leglabels_new = []
    for iEBF, forcing in enumerate(extbnd_forcings):
        print(f'forcing {iEBF+1} of {len(extbnd_forcings)}: {forcing.name} ({forcing.function}) ({forcing.quantityunitpair[1].quantity})')
        if use_dfm_tools:
            forcing_pd = forcingobject_to_dataframe(forcing)
            ax.set_title(f'{extbnd_filepath}')
            pc = forcing_pd.plot(ax=ax) #TODO: see CMEMS_interpolate_example.py for pcolormesh in case of verticalpositions
            leglabels = pc.get_legend_handles_labels()[1]
            leglabels_new.append(f'{forcing.name} ({forcing.function}) {leglabels[-1]}')
    ax.legend(leglabels_new)
    fig.tight_layout()

Expected behavior
I would expect the forcingdata to be present once instead of in this case three times. In the kernel the pli name decides which forcingblock is read from the ext file I think.

Additional information

  • please note, in case if caching is considered, that some bc-files are very huge and memory overflows have to be avoided in that case. An example is p:\archivedprojects\11209278-004-dcsm-fm\models\dflowfm3d-waddenzee_200m-j22_6-v1a\boundary_conditions\hist\2017\flow\DWSM_Sea_uxuyadvection.bc

Version info (please complete the following information):

  • OS: Windows 10 and Spyder
  • Version: hydrolib-core main branch
@veenstrajelmer
Copy link
Collaborator Author

A comment was made that the bc file is probably read only once and used multiple times. However, If I run this script on a ext file with 1 forcingblock (eg only Waal) it takes 4.8 seconds, while it takes 13-15 seconds with three blocks (waal/lek/maas). Therefore I expect that the file is in fact read multiple times.

@priscavdsluis priscavdsluis added this to To do in HYDROLIB-core via automation Nov 8, 2022
@priscavdsluis priscavdsluis added type: bug Something isn't working priority: high labels Apr 9, 2024
@MRVermeulenDeltares MRVermeulenDeltares moved this from To do to In progress in HYDROLIB-core Apr 29, 2024
@MRVermeulenDeltares MRVermeulenDeltares self-assigned this Apr 29, 2024
@MRVermeulenDeltares
Copy link
Contributor

I verified that the file is indeed loaded 3 seperate times.
We could introduce caching or lazyloading for these specific files or for all files in general.
We do need to look into edge cases for this, e.g. how would caching affect the user if they try to load a file they adjusted in the same script.
Will discuss with @rhutten

@MRVermeulenDeltares
Copy link
Contributor

MRVermeulenDeltares commented May 1, 2024

Edge cases for caching --> problem with bigger files:

  • What to do when a file is bigger than x mb or gb (t.b.d.) (memory issues --> segmentation fault)
  • Can a toolbox be used for this?

To do:

  • Research how big files are handled with caching -> e.g. can be tested.
  • Research if there are external packages which can be used to do this.

Other possibility when the issue becomes too complex:

  • If caching/lazyloading for ascii files is a bit too complext to implement, it could also be considered to not implement this. Alternatively, users can use netcdf files for boundary conditions in the near future.

@MRVermeulenDeltares
Copy link
Contributor

MRVermeulenDeltares commented May 1, 2024

prototype

I made a simple prototype which caches the data from the file into a dictionary based on the filepath.
It checks if the file has changed based on a checksum and rereads it based on that, else it will return the cached data.

Object to hold the checksum and data and to verify if the checksum has changed.

class CachedPathData():

    _file_data : Dict
    _checksum : str
    
    @property
    def file_data(self) -> Dict:
        return self._file_data
    
    @property
    def checksum(self) -> str:
        return self._checksum

    def __init__(self, filepath : Path, file_data : Dict) -> None:
        self._file_data = file_data
        self._checksum = FileChecksumCalculator.calculate_checksum(filepath)
        
    def checksum_changed(self, filepath : Path) -> bool:
        return self.checksum != FileChecksumCalculator.calculate_checksum(filepath)

Base singleton class, to make the cacher a singleton.

class Singleton:
    _instance = None

    def __new__(cls, *args, **kwargs):
        if not cls._instance:
            cls._instance = super().__new__(cls, *args, **kwargs)
        return cls._instance

Cache class.

class FileDataCacheLoader(Singleton):
    _loaded_files: Dict[Path, Any] = {}
    
    def load(self, filepath: Path, parse: Callable[[], Dict]) -> Dict:
        cached_data : CachedPathData= self._loaded_files.get(filepath, None)
        
        if cached_data is None or cached_data.checksum_changed(filepath):
            file_data = parse(filepath)
            cached_data = CachedPathData(filepath, file_data)
            self._loaded_files[filepath] = cached_data
            
        return cached_data.file_data
    
    def clear(self):
        self._loaded_files.clear()
        self._instance = None

Class to calculate the checksum.

class FileChecksumCalculator:
    @staticmethod
    def calculate_checksum(filepath: Path):
        sha256_hash = sha256()
        with open(filepath, "rb") as file:
            for chunk in iter(lambda: file.read(4096), b""):
                sha256_hash.update(chunk)
        return sha256_hash.hexdigest()

The ParsableFileModel has its load updated like this.

    _file_data_cache_loader = FileDataCacheLoader()

    def _load(self, filepath: Path) -> Dict:
        # TODO Make this lazy in some cases so it doesn't become slow
        if filepath.is_file():
            return self._file_data_cache_loader.load(filepath, self._parse)
        else:
            raise ValueError(f"File: `{filepath}` not found, skipped parsing.")

External packages which can be used

Caching

python packages which can be used for caching are as described below.

cachetools

diskcache

Tools for profiling with Caching

memory_profiler

timeit

cprofile & profile

Profiling

Profiling without prototype

memory_profiler

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     92.6 MiB     92.6 MiB           1   @profile
     7                                         def method():
     8     92.6 MiB      0.0 MiB           1       plt.close('all')
     9
    10     92.6 MiB      0.0 MiB           1       use_dfm_tools = False
    11     92.6 MiB      0.0 MiB           1       if use_dfm_tools:
    12                                                 from dfm_tools.hydrolib_helpers import forcingobject_to_dataframe
    13
    14     92.6 MiB      0.0 MiB           1       file_extnew = Path(r'C:\Users\Verme_mn\OneDrive - Stichting Deltares\Documents\HYDROLIB\issue_354\computations\model_setup\run_206_HYDROLIB\RMM_bnd.ext') # was RMM_bnd_5bnds.ext
    15   2447.6 MiB   2355.0 MiB           1       ext = ExtModel(file_extnew)
    16
    17   2447.6 MiB      0.0 MiB           1       ext_boundaries = ext.boundary
    18   2576.3 MiB      0.0 MiB          13       for iEB, extbnd in enumerate(ext_boundaries):
    19   2566.0 MiB      0.0 MiB          12           extbnd_filepath = extbnd.forcingfile.filepath
    20   2566.0 MiB      0.0 MiB          12           print(f'boundary {iEB+1} of {len(ext_boundaries)}: {extbnd_filepath}')
    21   2566.0 MiB      0.0 MiB          12           extbnd_forcings = extbnd.forcingfile.forcing
    22   2573.5 MiB     93.1 MiB          12           fig,ax = plt.subplots(figsize=(12,6))
    23   2573.5 MiB      0.0 MiB          12           leglabels_new = []
    24   2573.5 MiB      0.0 MiB         193           for iEBF, forcing in enumerate(extbnd_forcings):
    25   2573.5 MiB      0.0 MiB         181               print(f'forcing {iEBF+1} of {len(extbnd_forcings)}: {forcing.name} ({forcing.function}) ({forcing.quantityunitpair[1].quantity})')
    26   2573.5 MiB      0.0 MiB         181               if use_dfm_tools:
    27                                                         forcing_pd = forcingobject_to_dataframe(forcing)
    28                                                         ax.set_title(f'{extbnd_filepath}')
    29                                                         pc = forcing_pd.plot(ax=ax) #TODO: see CMEMS_interpolate_example.py for pcolormesh in case of verticalpositions
    30                                                         leglabels = pc.get_legend_handles_labels()[1]
    31                                                         leglabels_new.append(f'{forcing.name} ({forcing.function}) {leglabels[-1]}')
    32   2573.5 MiB      0.0 MiB          12           ax.legend(leglabels_new)
    33   2576.3 MiB     35.7 MiB          12           fig.tight_layout()

Timeit

Execution time: 160.72932409999885 seconds

Profiling with prototype

memory_profiler

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     92.4 MiB     92.4 MiB           1   @profile
     7                                         def method():
     8     92.4 MiB      0.0 MiB           1       plt.close('all')
     9
    10     92.4 MiB      0.0 MiB           1       use_dfm_tools = False
    11     92.4 MiB      0.0 MiB           1       if use_dfm_tools:
    12                                                 from dfm_tools.hydrolib_helpers import forcingobject_to_dataframe
    13
    14     92.5 MiB      0.0 MiB           1       file_extnew = Path(r'C:\Users\Verme_mn\OneDrive - Stichting Deltares\Documents\HYDROLIB\issue_354\computations\model_setup\run_206_HYDROLIB\RMM_bnd.ext') # was RMM_bnd_5bnds.ext
    15   4098.9 MiB   4006.5 MiB           1       ext = ExtModel(file_extnew)
    17   4098.9 MiB      0.0 MiB           1       ext_boundaries = ext.boundary
    18   4235.7 MiB      0.0 MiB          13       for iEB, extbnd in enumerate(ext_boundaries):
    19   4224.9 MiB      0.0 MiB          12           extbnd_filepath = extbnd.forcingfile.filepath
    20   4224.9 MiB      0.0 MiB          12           print(f'boundary {iEB+1} of {len(ext_boundaries)}: {extbnd_filepath}')
    21   4224.9 MiB      0.0 MiB          12           extbnd_forcings = extbnd.forcingfile.forcing
    22   4232.8 MiB     99.4 MiB          12           fig,ax = plt.subplots(figsize=(12,6))
    23   4232.8 MiB      0.0 MiB          12           leglabels_new = []
    24   4232.8 MiB      0.0 MiB         193           for iEBF, forcing in enumerate(extbnd_forcings):
    25   4232.8 MiB      0.0 MiB         181               print(f'forcing {iEBF+1} of {len(extbnd_forcings)}: {forcing.name} ({forcing.function}) ({forcing.quantityunitpair[1].quantity})')
    26   4232.8 MiB      0.0 MiB         181               if use_dfm_tools:
    27                                                         forcing_pd = forcingobject_to_dataframe(forcing)
    28                                                         ax.set_title(f'{extbnd_filepath}')
    29                                                         pc = forcing_pd.plot(ax=ax) #TODO: see CMEMS_interpolate_example.py for pcolormesh in case of verticalpositions
    30                                                         leglabels = pc.get_legend_handles_labels()[1]
    31                                                         leglabels_new.append(f'{forcing.name} ({forcing.function}) {leglabels[-1]}')
    32   4232.8 MiB      0.1 MiB          12           ax.legend(leglabels_new)
    33   4235.7 MiB     37.2 MiB          12           fig.tight_layout()

timeit

Execution time: 153.78531450000082 seconds

Conclusion Profiling

The above profilings will be compared and a conclusion will be written down based on the profiles if possible.

memory_profiler

The caching prototype uses more memory, since the data from the files is cached.
The problem with the memory usage could be tackled with the use of external packages, which have specific caches.
e.g. Least Frequently Used or Least Recently Used caches, which clear based on the usage of the object.
More information in regard to runtime is needed, timit profile difference is needed to get a more conclusive outcome from the data, e.g. is extra memory usage worth the runtime difference.

timeit

Approx 7 seconds margin on approx 1.5 gb extra memory usage for this specifc case.

other notable findings

In the prototype I implemented the caching on the ParsableFileModel level load.
I found that some type of caching is already done on one level higher FileModel.
In the FileModel in the ctor the following piece of code is executed:

            data = self._load(loading_path)
            context.register_model(filepath, self)
            data["filepath"] = filepath
            kwargs.update(data)

            context.register_model(filepath, self)

The self._load(loading_path) calls the loading of the files itself.
The context.register_model(filepath, self) caches the model based on the filepath, similar as in the prototype.
The load could be executed here based on whether the filepath already has a model available in its cache.
one peculiar thing is that the context.register_model(filepath, self) is called twice, I do not see why this is done.

⚠️ With this information it is relevant that caching should not be done on the load level, since we would be caching twice, since it is also done on filemodel level.
The filemodel level should caching should be updated to not load the file if it is already cached.
Updated to the caching can still be considered, and should still be tested to verify functionality and usability currently already explored in this issue.

Updated the Caching in filemodel in the following way:

            if (data := context.retrieve_model(filepath)) is None:
                data = self._load(loading_path)
                context.register_model(filepath, self)
                data["filepath"] = filepath
            kwargs.update(data)

            #context.register_model(filepath, self)

timeit gave the following runtime:
Execution time: 135.82439739999973 seconds

making a time difference without this change approx. 25 seconds.

Profiling with caching

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     84.1 MiB     84.1 MiB           1   @profile
     7                                         def method():
     8     84.1 MiB      0.0 MiB           1       file_extnew = Path(r'C:\Users\Verme_mn\OneDrive - Stichting Deltares\Documents\HYDROLIB\issue_354\computations\model_setup\run_206_HYDROLIB\RMM_bnd.ext') # was RMM_bnd_5bnds.ext
     9   3842.6 MiB   3758.5 MiB           1       ext = ExtModel(file_extnew)


Execution time: 4327.978529399999 seconds

profiling without caching

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     6     83.4 MiB     83.4 MiB           1   @profile
     7                                         def method():
     8     83.4 MiB      0.0 MiB           1       file_extnew = Path(r'C:\Users\Verme_mn\OneDrive - Stichting Deltares\Documents\HYDROLIB\issue_354\computations\model_setup\run_206_HYDROLIB\RMM_bnd.ext') # was RMM_bnd_5bnds.ext
     9   3422.8 MiB   3339.4 MiB           1       ext = ExtModel(file_extnew)


Execution time: 4504.056828699999 seconds

Noteworthy testing outcome

I tried testing loading an ext file which tries to load the same file 11 times.
This file is ~5 Mb.
With the caching enabled the loading takes approx 3 seconds and uses ~150 Mb is memory.

With the caching disabled the loading takes approx 33 seconds and uses ~660 Mb is memory.

I tried testing with a few self made bc files of sizes from 300 Mb to 1.4 Gb.
⚠️ I canceled these test prematurely, since the python run started to use up to 10GB of memory.
This might be something to look into in the future.

Update on testing:

after implementation I tested again on the PR and on the master.
Prior I checked what the memory usage was in the task manager.
I updated the script to also use the memory_usage method from the memory_profiler, to retrieved the max used memory by the executed method.
I retrieved the following data with this:

Main:
Highest memory usage during one run: 291.11328125 MB
Execution time of one run: 27.909129899999243 seconds

PR:
Highest memory usage during one run: 240.171875 MB
Execution time of one run: 2.5429680999986886 seconds

Differences Main and PR:
Max memory usage ~50MB
Execution time ~25 sec

The files used for this testing might not be the same as the previous tests.
In the previous test the task manager was used to guess the max memory usage, this might have involved more processes.

Test files:
Test_case.zip

points for discussion in the next refinement

  • The caching should be updated in the filemodel instead of the ParsableFileModel

  • Should the existing caching be updated with a package for caching, e.g. cachetools?

  • How should the caching handle changes in files which are cached, should it check this with a checksum, handled by package if possible?

  • How should the caching handle big models, can it be handled by a package, e.g. Least Frequently Used or Least Recently Used caches?

  • When changes are implemented, which cases should be tested?

    • e.g. big files >1 Gb big upt to 20 Gb.
    • Making changes to a cached file will reread the file when the model is accessed again.
    • other?
  • Time impact checked with the final implementation to see loading improvement and memory load.

Requirements

Discussed with @veenstrajelmer, the following requirements are for this issue and should be tackled, other points can be thought about and implemented in a followup issue when needed.

  • The existing caching should be expanded to also be used with initialization reading, instead of only newing objects.
  • The existing caching should be expanded to verify if a file has changed before returning the cache data.
    • When the file has changed a new cache of the file should be made.

Testcase to take in account:
Reading a file --> updating and changing the model it is read to --> saving the file --> reading the same file, changed data is read, since the file has been updated while the application is running.
(e.g. this is testing that the cache is updated based on if the file has changed)

@MRVermeulenDeltares MRVermeulenDeltares moved this from In progress to Review/test follow up in HYDROLIB-core May 2, 2024
@MRVermeulenDeltares MRVermeulenDeltares moved this from Review/test follow up to In progress in HYDROLIB-core May 6, 2024
@MRVermeulenDeltares MRVermeulenDeltares moved this from In progress to Review/test follow up in HYDROLIB-core May 6, 2024
@MRVermeulenDeltares MRVermeulenDeltares moved this from Review/test follow up to In progress in HYDROLIB-core May 8, 2024
@MRVermeulenDeltares MRVermeulenDeltares linked a pull request May 8, 2024 that will close this issue
3 tasks
@MRVermeulenDeltares MRVermeulenDeltares moved this from In progress to Ready to review in HYDROLIB-core May 8, 2024
@MRVermeulenDeltares MRVermeulenDeltares linked a pull request May 8, 2024 that will close this issue
3 tasks
@tim-vd-aardweg tim-vd-aardweg moved this from Ready to review to Review/test follow up in HYDROLIB-core May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high type: bug Something isn't working
Projects
HYDROLIB-core
  
Review/test follow up
Development

Successfully merging a pull request may close this issue.

3 participants