安装 1 2 pip install transformers --proxy http://127.0 .0 .1 :8080
所有模型以及数据集下载地址: C:\Users\xxx.cache\huggingface\hub
使用 设置代理,默认会下载模型缓存
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import osfrom transformers import pipeline proxies = { "http" : "http://127.0.0.1:7890" , "https" : "http://127.0.0.1:7890" , } os.environ["HTTP_PROXY" ] = proxies["http" ] os.environ["HTTPS_PROXY" ] = proxies["https" ] classifier = pipeline("sentiment-analysis" ) result = classifier( [ "I've been waiting for a HuggingFace course my whole life." , "I hate this so much!" , ] )print ("result" , result)
不设置代理,使用国内镜像
1 2 3 4 5 6 7 8 9 10 11 12 13 import osfrom transformers import pipeline os.environ["HF_ENDPOINT" ] = "https://hf-mirror.com" classifier = pipeline("sentiment-analysis" ) classifier( [ "I've been waiting for a HuggingFace course my whole life." , "I hate this so much!" , ] )
Pipeline类 负责整个模型使用流程,你只需要提供输入数据,就能得到最终结果 阅读源码,pipeline方法返回一个Pipeline类对象,该对象主要有以下属性(先记录部分)
task: str The task defining which pipeline will be returned。源码解释有很多示范,例如"audio-classification"
: will return a [AudioClassificationPipeline
]
model: str 模型名称,或者 [PreTrainedModel
] (for PyTorch) or [TFPreTrainedModel
] (for TensorFlow)
tokenizer 分词器 根据官网,有诸如,基于单词(Word-based)的 tokenization,基于字符(Character-based)的 tokenization,基于子词(subword)的 tokenization等
AutoTokenizer类,自动加载分词器用 自动加载
1 2 from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased" )
手动加载
1 2 3 from transformers import BertTokenizer tokenizer = BertTokenizer . from_pretrained("bert-base-uncased" )
AutoModel类,加载预训练模型用 1 2 3 4 from transformers import AutoModel checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" model = AutoModel.from_pretrained(checkpoint)
AutoModel 负责模型的加载,你需要自己编写代码来使用模型。这是和Pipeline类的区别
在源代码中,展示了很多封装好的加载模型方法
1 2 3 4 5 6 7 8 9 class AutoModel (_BaseAutoModelClass ): _model_mapping = MODEL_MAPPINGclass AutoModelForSequenceClassification (_BaseAutoModelClass ): _model_mapping = MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPINGclass AutoModelForCausalLM (_BaseAutoModelClass ): _model_mapping = MODEL_FOR_CAUSAL_LM_MAPPING
Trainer类 训练模型用 1 2 3 4 5 6 7 8 Trainer( model, training_args, train_dataset=tokenized_datasets["train" ], eval_dataset=tokenized_datasets["validation" ], data_collator=data_collator, tokenizer=tokenizer, )
按照官网来尝试 首先第一步 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import torchfrom transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification # 和之前一样checkpoint = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(checkpoint ) model = AutoModelForSequenceClassification.from_pretrained(checkpoint )sequences = [ "I've been waiting for a HuggingFace course my whole life.", "This course is amazing!", ] batch = tokenizer(sequences , padding=True , truncation=True , return_tensors="pt") # 新增部分 batch["labels"] = torch.tensor([1 , 1 ]) optimizer = AdamW(model.parameters()) loss = model(**batch).loss loss.backward() optimizer.step() optimizer.zero_grad() # 清零梯度 官网没有这一步,没得话无法打印end 。 print('end' )
batch[“labels”] = torch.tensor([1, 1]) 这行代码为输入批次添加了标签,用于训练或评估序列分类模型。
标签张量包含了每个输入序列的类别信息。
模型会使用这些标签来学习如何正确地对序列进行分类。
AdamW类提示即将弃用,你需要使用 PyTorch 的 torch.optim.AdamW 优化器,或者禁用弃用警告
第二步,尝试使用从模型中心(Hub)加载数据集,而不是自己torch.tensor([1, 1]) 1 2 3 4 5 6 import osfrom datasets import load_dataset os.environ["HF_ENDPOINT" ] = "https://hf-mirror.com" raw_datasets = load_dataset("glue" , "mrpc" )print (raw_datasets)
第三步,尝试使用Trainer类,整体代码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 cache_dir = "C:/Users/12986/.cache/huggingface/hub" raw_datasets = load_dataset("glue" , "mrpc" , cache_dir=cache_dir) checkpoint = "bert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(checkpoint)def tokenize_function (example ): return tokenizer(example["sentence1" ], example["sentence2" ], truncation=True ) tokenized_datasets = raw_datasets.map (tokenize_function, batched=True ) data_collator = DataCollatorWithPadding(tokenizer=tokenizer) training_args = TrainingArguments("test-trainer" , evaluation_strategy="epoch" )from transformers import AutoModelForSequenceClassification model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2 )from transformers import Trainer trainer = Trainer( model, training_args, train_dataset=tokenized_datasets["train" ], eval_dataset=tokenized_datasets["validation" ], data_collator=data_collator, tokenizer=tokenizer, ) trainer.train() predictions = trainer.predict(tokenized_datasets["validation" ])print (predictions.predictions.shape, predictions.label_ids.shape)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 DatasetDict({ train: Dataset({ features: ['sentence1' , 'sentence2' , 'label' , 'idx' ], num_rows: 3668 }) validation: Dataset({ features: ['sentence1' , 'sentence2' , 'label' , 'idx' ], num_rows: 408 }) test: Dataset({ features: ['sentence1' , 'sentence2' , 'label' , 'idx' ], num_rows: 1725 }) })
DatasetDict 对象,这个对象包含训练集、验证集和测试集