[Transformer AI] 讓模型在 GPU/ MPS 上奔馳，帶來更高的效率 - Eidolon's 系統整合研發工程師的成長手札

最近投入不少時間在這塊技術的摸索，所以更新速度稍微慢了些
上次我們在系列的第一篇文章中，說明如何透過hugging face 這個平台找到想要使用的模型，並且打造一個自己的離線翻譯引擎
有興趣的可以參考這個連結進行回顧- [Transformer AI] 打造離線版的文字翻譯服務
相信對於服務整個跑起來應該很是振奮吧！但相信很快就會遇到一個問題，雖然模型很方便，但要大量使用時，純粹靠著CPU 來運算，效率還是略顯慢了點，因此就造就了這篇文章的誕生，能不能讓模型在我的顯示卡上頭運行？以及要如何快速完成程式碼調整？

多數網路上頭的教學文章談到顯示卡加速，多半是說明Nvidia 的顯示卡，也就是 cuda platform 。
然而這幾年 Apple M 系列芯片所應用的統一記憶體架構(Unified Memory Architecture，UMA) 再度在 AI 領域掀起一陣波瀾，特別是在 LLM 大型語言模型所需耗費的 VRAM 需求下，這樣的方案更顯超值。因為VRAM 跟 RAM 是共用同一個記憶體空間，所以就不需要擔心模型連最載入記憶體都成了最大的障礙，當然這項技術也不是萬靈丹，畢竟 core 的數量還是不及 Nvidia 具龐大規模的純在，但至少能讓模型是可以載入圖像加速單元進行運算的，理論上只要充裕的時間下，還是可以順利得到模型的結果產出。

所以言歸正傳，我們記錄一下該如何調整既有程式碼來讓模型移轉到 apple M 系列的圖像單元進行運算。
先上原始程式碼

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("./opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("./opus-mt-zh-en")

text = '今天是聖誕節，祝大家聖誕快樂！'

# 直接使用 tokenizer 將文本轉換成模型需要的格式
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# 使用模型進行翻譯
translation = model.generate(**inputs)

# 將翻譯結果轉換為文本
result = tokenizer.decode(translation[0], skip_special_tokens=True)

# 印出翻譯結果
print(result) #It's Christmas. Merry Christmas to you all!

修改其實也滿容易的，我們先梳理一下流程再來寫程式碼：

先檢查電腦是否有找到圖像加速單元，如果有，我們定義運算的裝置為『mps』：因為我們是在 mac 上頭 M 系列的圖像單元加速運算，所以有別傳統的『 cuda 』，而是使用 Apple的渲染器Metal Performance Shaders（MPS）作為運行裝置；如果圖像加速單元不存在時，我們還是希望可以將模型運行裝置定義為『cpu』
裝置定義完成後，我們需要將模型與輸入參數推送到指定的裝置上，讓後續的推論（inference) 在指定裝置上運行

接著我們將上面提到的流程轉成對應的程式碼

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("./opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("./opus-mt-zh-en")

# check whether the mps is available
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')

# check whether the mps is enabled
if device.type == 'mps':
    print("MPS已經成功啟用！")
else:
    print("MPS未啟用。請確保你的環境和設定正確。")

text = '今天是聖誕節，祝大家聖誕快樂！'

# 直接使用 tokenizer 將文本轉換成模型需要的格式
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# send the inputs to device
inputs = {key: inputs[key].to(device) for key in inputs}

# send the model to device
model.to(device)

with torch.no_grad():
    # 使用模型進行翻譯
    translation = model.generate(**inputs)

    # 將翻譯結果轉換為文本
    result = tokenizer.decode(translation[0], skip_special_tokens=True)

    # 印出翻譯結果
    print(result) #It's Christmas. Merry Christmas to you all!

其實整個操作跟傳統的 pytorch 也很像，幾行代碼就完成加速運算了！如果今天使用的是 cuda 方案，其實也很簡單，將代碼中的mps 調整為 cuda 就能順利運行，以及判斷加速運算是否存在的函式修改整 torch.cuda.is_available()。

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("./opus-mt-zh-en")
model = AutoModelForSeq2SeqLM.from_pretrained("./opus-mt-zh-en")

# check whether the cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# check whether the mps is enabled
if device.type == 'mps':
    print("CUDA已經成功啟用！")
else:
    print("CUDA未啟用。請確保你的環境和設定正確。")

text = '今天是聖誕節，祝大家聖誕快樂！'

# 直接使用 tokenizer 將文本轉換成模型需要的格式
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# send the inputs to device
inputs = {key: inputs[key].to(device) for key in inputs}

# send the model to device
model.to(device)

with torch.no_grad():
    # 使用模型進行翻譯
    translation = model.generate(**inputs)

    # 將翻譯結果轉換為文本
    result = tokenizer.decode(translation[0], skip_special_tokens=True)

    # 印出翻譯結果
    print(result) #It's Christmas. Merry Christmas to you all!

發佈留言 取消回覆

發佈留言取消回覆