Whisper Large-v3 是 OpenAI 开发的一款多任务语音识别和翻译模型。以下是详细的安装和使用步骤。
安装依赖
首先,确保你已经安装了Python3.8及以上版本。然后按照以下步骤安装必要的依赖库。
安装Python依赖库
pip install –upgrade pip
pip install –upgrade git+https://github.com/huggingface/transformers.git accelerate datasets[\audio\]
安装ffmpeg
Whisper模型依赖于ffmpeg来处理音频文件。你可以通过以下命令安装.
Ubuntu 或 Debian:sudo apt update && sudo apt install ffmpeg
Windows(使用 Chocolatey):choco install ffmpeg
macOS(使用 Homebrew):brew install ffmpeg
下载和加载模型
1.导入必要的库:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
设置设备(CPU 或 GPU):
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/whisper-large-v3"
加载模型和处理器:
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)
processor = AutoProcessor.from_pretrained(model_id)
处理和推理
1.加载数据集:
使用Hugging Face的datasets库加载示例数据集:
from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
2.处理输入音频:
将音频样本转换为模型可以接受的输入特征:
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
input_features = input_features.to(device)
3.生成预测:
使用模型生成语音识别结果:
gen_kwargs = {"max_new_tokens": 128, "num_beams": 1, "return_timestamps": False}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)
print(pred_text)
完整示例代码
以下是完整的代码示例,将上述步骤整合在一起:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import load_dataset
# 设置设备(CPU 或 GPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "openai/whisper-large-v3"
# 加载模型和处理器
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id).to(device)
processor = AutoProcessor.from_pretrained(model_id)
# 加载数据集
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
# 处理输入音频
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
input_features = input_features.to(device)
# 生成预测
gen_kwargs = {"max_new_tokens": 128, "num_beams": 1, "return_timestamps": False}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True)
# 输出结果
print(pred_text)
© 版权声明
文章版权归作者所有,未经允许请勿转载。
相关文章
暂无评论...