Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAN dataset crashing when loading .tsv file #53

Open
exnx opened this issue Feb 26, 2023 · 4 comments
Open

AAN dataset crashing when loading .tsv file #53

exnx opened this issue Feb 26, 2023 · 4 comments

Comments

@exnx
Copy link

exnx commented Feb 26, 2023

Did anyone else have issues loading the AAN dataset into memory? In particular when I load the .tsv file into memory, it crashes :/ I used several different instances on Google Cloud, with varying amount of memory, up to 170G, 24 cpus, but it still crashed. I feel like I am missing something. Here's my snippet of code that crashes the instance every time.

from datasets import DatasetDict, Value, load_dataset
...

        dataset = load_dataset(
            "csv",
            data_files={
                "train": str(self.data_dir / "new_aan_pairs.train.tsv"),  # 8G file
                "val": str(self.data_dir / "new_aan_pairs.eval.tsv"),
                "test": str(self.data_dir / "new_aan_pairs.test.tsv"),
            },
            delimiter="\t",
            column_names=["label", "input1_id", "input2_id", "text1", "text2"],
            keep_in_memory=True,
@jmycsu
Copy link

jmycsu commented Jul 11, 2023

Did anyone else have issues loading the AAN dataset into memory? In particular when I load the .tsv file into memory, it crashes :/ I used several different instances on Google Cloud, with varying amount of memory, up to 170G, 24 cpus, but it still crashed. I feel like I am missing something. Here's my snippet of code that crashes the instance every time.

from datasets import DatasetDict, Value, load_dataset
...

        dataset = load_dataset(
            "csv",
            data_files={
                "train": str(self.data_dir / "new_aan_pairs.train.tsv"),  # 8G file
                "val": str(self.data_dir / "new_aan_pairs.eval.tsv"),
                "test": str(self.data_dir / "new_aan_pairs.test.tsv"),
            },
            delimiter="\t",
            column_names=["label", "input1_id", "input2_id", "text1", "text2"],
            keep_in_memory=True,

@exnx Hello! Sorry to bother you. I got some problems when downloading the AAN
dataset using the link [http://aan.how/download/]. Could you please tell me the right way to download the AAN dataset or share a link to it?

@WonderSeven
Copy link

Hi, there,

The provided download URL no longer works now, could anyone share the data, many thx!

@sneerajmohan
Copy link

Hi, there,

The provided download URL no longer works now, could anyone share the data, many thx!

Have you figured out any way to download it ?

@WonderSeven
Copy link

Hi, there,
The provided download URL no longer works now, could anyone share the data, many thx!

Have you figured out any way to download it ?

No, I cannot find anywhere to download AAN dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants