AI 预训练-02

安装

1
2
# 注意使用代理
pip install transformers --proxy http://127.0.0.1:8080

所有模型以及数据集下载地址: C:\Users\xxx.cache\huggingface\hub

使用

设置代理,默认会下载模型缓存

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import os
from transformers import pipeline
proxies = {
"http": "http://127.0.0.1:7890",
"https": "http://127.0.0.1:7890",
}
# 设置 requests 的代理
os.environ["HTTP_PROXY"] = proxies["http"]
os.environ["HTTPS_PROXY"] = proxies["https"]

classifier = pipeline("sentiment-analysis")
result = classifier(
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)
print("result", result)

不设置代理,使用国内镜像

1
2
3
4
5
6
7
8
9
10
11
12
13
import os
from transformers import pipeline

# 设置镜像
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"

classifier = pipeline("sentiment-analysis")
classifier(
[
"I've been waiting for a HuggingFace course my whole life.",
"I hate this so much!",
]
)

Pipeline类 负责整个模型使用流程,你只需要提供输入数据,就能得到最终结果

阅读源码,pipeline方法返回一个Pipeline类对象,该对象主要有以下属性(先记录部分)

  • task: str
    The task defining which pipeline will be returned。源码解释有很多示范,例如"audio-classification": will return a [AudioClassificationPipeline]
  • model: str 模型名称,或者 [PreTrainedModel] (for PyTorch) or [TFPreTrainedModel] (for TensorFlow)
  • tokenizer 分词器
    根据官网,有诸如,基于单词(Word-based)的 tokenization,基于字符(Character-based)的 tokenization,基于子词(subword)的 tokenization等

AutoTokenizer类,自动加载分词器用

自动加载

1
2
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

手动加载

1
2
3
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

AutoModel类,加载预训练模型用

1
2
3
4
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

AutoModel 负责模型的加载,你需要自己编写代码来使用模型。这是和Pipeline类的区别

在源代码中,展示了很多封装好的加载模型方法

1
2
3
4
5
6
7
8
9
# 用于自动加载任何类型的预训练模型
class AutoModel(_BaseAutoModelClass):
_model_mapping = MODEL_MAPPING
# 用于加载适用于序列分类任务的模型
class AutoModelForSequenceClassification(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
# 用于加载适用于因果语言建模的模型(例如GPT模型,GPT:是“Generative Pre-trained Transformer”(生成式预训练变换器)的缩写)
class AutoModelForCausalLM(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_CAUSAL_LM_MAPPING

Trainer类 训练模型用

1
2
3
4
5
6
7
8
Trainer(
model, # 模型
training_args, # 训练参数
train_dataset=tokenized_datasets["train"], # 训练数据集
eval_dataset=tokenized_datasets["validation"], # 验证数据集
data_collator=data_collator, # 数据预处理 负责对批量数据进行动态填充和整理,确保输入数据的长度一致
tokenizer=tokenizer, # 分词器
)

按照官网来尝试

首先第一步
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
# 和之前一样
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
"I've been waiting for a HuggingFace course my whole life.",
"This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
# 新增部分
batch["labels"] = torch.tensor([1, 1])
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad() # 清零梯度 官网没有这一步,没得话无法打印end
print('end')
  • batch[“labels”] = torch.tensor([1, 1]) 这行代码为输入批次添加了标签,用于训练或评估序列分类模型。
  • 标签张量包含了每个输入序列的类别信息。
  • 模型会使用这些标签来学习如何正确地对序列进行分类。
  • AdamW类提示即将弃用,你需要使用 PyTorch 的 torch.optim.AdamW 优化器,或者禁用弃用警告
第二步,尝试使用从模型中心(Hub)加载数据集,而不是自己torch.tensor([1, 1])
1
2
3
4
5
6
import os
from datasets import load_dataset
#要么设置镜像,要么设置代理,别忘了
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
raw_datasets = load_dataset("glue", "mrpc")
print(raw_datasets)
第三步,尝试使用Trainer类,整体代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 指定本地缓存目录
cache_dir = "C:/Users/12986/.cache/huggingface/hub" # 替换为你的实际路径
# 加载数据集,并指定缓存目录
raw_datasets = load_dataset("glue", "mrpc", cache_dir=cache_dir)
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# 使用map方法对raw_datasets数据集进行批量分词处理,tokenize_function是分词函数,batched=True表示对数据进行批量处理
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

from transformers import Trainer
# 这里就是最终的Trainer调用
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)

trainer.train()

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# 打印结果
DatasetDict({
train: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 3668
})
validation: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 408
})
test: Dataset({
features: ['sentence1', 'sentence2', 'label', 'idx'],
num_rows: 1725
})
})

DatasetDict 对象,这个对象包含训练集、验证集和测试集


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!