HuggingFace-模型下载方法分类

安装
pip install huggingface_hub

应用

1
2

from huggingface_hub import snapshot_download
snapshot_download(repo_id="HuggingFaceH4/zephyr-7b-alpha", allow_patterns=["*.md", "*.json", "*.safetensors"], ignore_patterns=["*.bin"], local_dir="C:\\Users\\some\\some\\zephyr-7b-alpha", local_dir_use_symlinks=False)

单文件下载，文件详情页面直接点击下载按钮另存
使用Git LFS 下载整个存储库
下载整个存储库，文件较大时下载进度无法查看，会包含一些不需要的文件（.bin/.safetensors）
使用huggingface_hub library 提供方法下载

使用`huggingface_hub library`下载文件

参考官方文档从应用中心下载文件

下载单个文件 hf_hub_download()
下载整个存储库 snapshot_download()

下载单个文件 `hf_hub_download()`

它下载远程文件，将其缓存在磁盘上（以版本感知方式），并返回其本地文件路径。
返回的文件路径是指向 HF 本地缓存的指针。因此，重要的是不要修改文件以避免缓存损坏。
参数 repo_id repo_type filename
参数 local_dir定义本地保存路径
参数local_dir_use_symlinks=False定义文件如何保存在本地文件夹中

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json")
'/root/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade/config.json'

# Download from a dataset
hf_hub_download(repo_id="google/fleurs", filename="fleurs.py", repo_type="dataset")
'/root/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34/fleurs.py'

# 设置保存路径，并且禁止缓存
hf_hub_download(repo_id="HuggingFaceH4/zephyr-7b-alpha", filename="model-00001-of-00008.safetensors", local_dir="C:\\Users\\some\\some\\zephyr-7b-alpha",local_dir_use_symlinks=False)

默认情况下，将下载分支中的最新版本。
但是，在某些情况下，您要下载文件在特定版本（例如，来自特定分支、PR、标签或提交哈希）。
为此，请使用参数：main revision

# Download from the `v1.0` tag
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="v1.0")

# Download from the `test-branch` branch
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="test-branch")

# Download from Pull Request #3
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="refs/pr/3")

# Download from a specific commit hash
hf_hub_download(repo_id="lysandre/arxiv-nlp", filename="config.json", revision="877b84a8f93f2d619faa2a6e514a32beef88ab0a")

注意：使用提交哈希时，它必须是全长哈希，而不是 7 个字符的提交哈希。
如果要构造用于从存储库下载文件的 URL，可以使用返回 URL 的 hf_hub_url()。请注意，它由 hf_hub_download()在内部使用。

下载整个存储库 `snapshot_download()`

snapshot_download()在给定修订版下载整个存储库。
它使用内部hf_hub_download()意味着所有下载的文件也缓存在本地磁盘上。同时进行下载以加快该过程。
要下载整个存储库，只需传递和：repo_id repo_type
参数 local_dir定义本地保存路径
参数local_dir_use_symlinks=False定义文件如何保存在本地文件夹中

from huggingface_hub import snapshot_download
snapshot_download(repo_id="lysandre/arxiv-nlp")
'/home/lysandre/.cache/huggingface/hub/models--lysandre--arxiv-nlp/snapshots/894a9adde21d9a3e3843e6d5aeaaf01875c7fade'

# Or from a dataset
snapshot_download(repo_id="google/fleurs", repo_type="dataset")
'/home/lysandre/.cache/huggingface/hub/datasets--google--fleurs/snapshots/199e4ae37915137c555b1765c01477c216287d34'

默认情况下，snapshot_download()会下载最新版本。如果需要特定的存储库版本，请使用参数：revision

1 2	from huggingface_hub import snapshot_download snapshot_download(repo_id="lysandre/arxiv-nlp", revision="refs/pr/1")

筛选要下载的文件
snapshot_download()提供了一种下载存储库的简单方法。但是，您并不总是想下载存储库的全部内容。可以使用参数执行此操作。
allow_patterns ignore_patterns

这些参数接受单个模式或模式列表。模式是标准通配符（通配符模式），如此处所述。模式匹配是基于FNMATCH。

from huggingface_hub import snapshot_download
snapshot_download(repo_id="lysandre/arxiv-nlp", allow_patterns="*.json")

from huggingface_hub import snapshot_download
snapshot_download(repo_id="lysandre/arxiv-nlp", ignore_patterns=["*.msgpack", "*.h5"])

from huggingface_hub import snapshot_download
snapshot_download(repo_id="gpt2", allow_patterns=["*.md", "*.json"], ignore_patterns="vocab.json")

将文件下载到自定义本地文件夹

参数 local_dir定义本地保存路径
参数local_dir_use_symlinks=False定义文件必须如何保存在本地文件夹中

默认或设置auto 是复制小文件（<5MB）到本地，并对较大的文件使用符号链接以优化带宽和磁盘使用情况。但是，手动编辑符号链接文件可能会损坏缓存。可以使用环境变量配置 5MB 阈值。local_dir_use_symlinks="auto"
如果设置True，则所有文件都进行符号链接，以实现最佳磁盘空间优化。在下载包含数千个小文件的大型数据集时很有用。local_dir_use_symlinks=True
最后，如果您根本不想要符号链接，则可以禁用它们False。缓存目录仍将用于检查文件是否已缓存。如果已经缓存，则从缓存中复制文件（即节省带宽但增加磁盘使用率）。如果文件尚未缓存，它将下载并直接移动到本地目录。这意味着，如果您以后需要在其他地方重用它，它将被重新下载。local_dir_use_symlinks=False

1
2
3

# 设置保存路径，并且禁止缓存
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="HuggingFaceH4/zephyr-7b-alpha", filename="model-00001-of-00008.safetensors", local_dir="C:\\Users\\some\\some\\zephyr-7b-alpha",local_dir_use_symlinks=False)

使用`huggingface-cli` 下载

需要登录 huggingface-cli login 可以追加参数 --token

huggingface-cli download --help

huggingface-cli download gpt2 config.json --token=hf_****
/home/wauplin/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/config.json

huggingface-cli download gpt2 config.json model.safetensors
/home/wauplin/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10

huggingface-cli download gpt2 config.json --cache-dir=./cache
./cache/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/config.json

huggingface-cli download gpt2 config.json --local-dir=./models/gpt2
./models/gpt2/config.json

huggingface-cli download bigcode/the-stack --repo-type=dataset --revision=v1.2 --include="data/python/*" --exclude="*.json" --exclude="*.zip"
Fetching 206 files:   100%|████████████████████████████████████████████| 206/206 [02:31<2:31, ?it/s]
/home/wauplin/.cache/huggingface/hub/datasets--bigcode--the-stack/snapshots/9ca8fa6acdbc8ce920a0cb58adcdafc495818ae7

`hf_transfer`提高下载速度

如果您在高带宽的机器上运行，则可以使用 hf_transfer 提高下载速度。
这是一个基于 Rust 的库，旨在通过 Hub 加快文件传输速度。要启用它，请安装软件包并设置为环境变量。
pip install hf_transfer HF_HUB_ENABLE_HF_TRANSFER=1

hf_transfer是一个高级用户工具！它经过测试和生产就绪，但缺乏进度条或高级错误处理等用户友好功能。

Oct 20, 2023

HuggingFace-模型下载方法分类

使用huggingface_hub library下载文件

下载单个文件 hf_hub_download()

下载整个存储库 snapshot_download()