Table of contents
DataLoader
(2024-05-29)
PyTorch DataLoader工作原理可视化 collate_fn
torch.utils.data.IterableDataset
pytorch forum 2020-02-26; Docs
ConcatDataset
-
Example 1: Pytorch DataLoader multiple data source - SO;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15import os import torch.utils.data as data class SingeJsonDataset(data.Dataset): # implement a single json dataset here... ... list_of_datasets = [] for j in os.path.listdir(root_dir): if not j.endswith('.json'): continue # skip non-json files list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None)) # once all single json datasets are created you can concat them into a single one: multiple_json_dataset = data.ConcatDataset(list_of_datasets) -
Example 2: PyTorch forum - Praateek
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17class LazyTextDataset(Dataset): def __init__(self, filename): self._filename = filename self._total_data = int(subprocess.check_output("wc -l " + filename, shell=True).split()[0]) - 1 def __getitem__(self, idx): line = linecache.getline(self._filename, idx + 1) csv_line = csv.reader([line]) return next(csv_line) def __len__(self): return self._total_data path = /where_csv_files_are_dumped/ files = list(map(lambda x : path + x, (filter(lambda x : x.endswith("csv"), os.listdir(path))))) datasets = list(map(lambda x : LazyTextDataset(x), files)) dataset = ConcatDataset(datasets)
Comments of Thomans Ahle:
The problem with
ConcatDatasetis that it doesn’t work with multiprocessing. It callslen(ds)on each dataset in it’s initializer, so you end up loading every dataset in the main process.
np.load(path, mmap_mode=‘r’)
Load multiple .npy files (size > 10GB) in pytorch - SO
|
|