How can I download data from just one of the DVC repositories?

42 Views Asked by At

I have a project that uses several databases, to avoid versioning huge files in git, I used DVC to manage it on gdrive.

I followed the following step by step on DVC

Start DVC (dvc init)

dvc add #dataset zip#

dvc remote add --default #drive_name# gdrive://#Folder ID#

dvc push

for each dataset. But when I try to upload such a data set individually through the

dvc pull --remote #drive_name#

it simply downloads all the files to my machine and not just the ones I specified, I've already run a dvc remote list and even seen in gdrive that the files are separated. Why can't I get them individually?

1

There are 1 best solutions below

0
Shcheklein On

If you need to store certain parts of the DVC project in some remote, and other part in a different remote storage there are two way to do this (or a mix of those).

  1. (I would recommend this). Use the remote: field in the .dvc files or dvc.yaml. For example:
stages:
  transpose:
    ...
    outs:
      - columns.txt
          remote: myremote

or:

outs:
  - md5: a304afb96060aad90176268345e10355
    path: data.xml
    desc: Cats and dogs dataset
    remote: myremote

In this case, you don't have to use --remote for dvc pull or dvc push - DVC will know automatically which remote to use for each datasets or models or output in general.

  1. You can indeed use --remote. But in this case (and that's where probably the issue is in your case), you would need to always carefully use dvc push to avoid by mistake pushing all data to a default remote storage. Always do dvc push --remote <dataset>. Or even don't use --default, don't even specify a default remote in this case. As you can see this can be a bit tedious tbh.

In both options, I would avoid creating a default remote (unless you have some objects that you want to always go to some default). Also, yes, you still need to use dvc remote add ... commands to create all these named remotes.