跳到主要内容

使用私有卷搭建Llama-2-7b模型问答服务

快速入门中,你已经创建了一个简单的应用。在这个示例中,我们使用私有卷存放Llama-2(7B)模型文件,实现一个基于Llama-2(7B)模型的文生文在线问答服务。

登录EVERAI客户端

首先,为你的应用创建一个目录,进入应用目录后,需要使用你从EverAI获取到的token进行登录。

everai login --token <your token>

创建密钥

如果你已经为你需要访问的镜像仓库创建过密钥,或者你的镜像仓库可以被公开访问,这一步是可选的。

在这个例子中,我们会为quay.io创建一个密钥。

everai secret create your-quay-io-secret-name \
--from-literal username=<your username> \
--from-literal password=<your password>

小贴士

quay.io是一个知名的公共镜像仓库,与之类似的知名镜像仓库还有Docker HubGitHub Container RegistryGoogle Container Registry等。

此外,还需要为访问Hugging Face的token设置一个密钥。

everai secret create your-huggingface-secret-name \
--from-literal token-key-as-your-wish=<your huggingface token>

创建configmap

该步骤可选,如果你配置了configmap,你可以在部署镜像后使用configmap调整你的自动扩缩容策略。

everai configmap create llama2-configmap \ 
--from-literal min_workers=1 \
--from-literal max_workers=5 \
--from-literal min_free_workers=1 \
--from-literal scale_up_step=1 \
--from-literal max_idle_time=60

编写你的代码

基本设置

这是一个关于app.py的示例代码。

首先,引入必要的EverAI Python类库。然后定义所需要用到的变量名,包括卷,访问镜像仓库的密钥,以及存放在卷中的文件等。使用Image.from_registry静态方法创建一个镜像实例。通过App类来创建定义一个app实例。

这里需要注意的是,你需要为你的应用配置GPU资源,这里配置的GPU型号是"A100 40G",GPU卡的数量是1。

from everai.app import App, context, VolumeRequest
from everai_autoscaler.builtin import FreeWorkerAutoScaler
from everai.image import Image, BasicAuth
from everai.resource_requests import ResourceRequests, CPUConstraints
from everai.placeholder import Placeholder
from image_builder import IMAGE

APP_NAME = '<your app name>'
VOLUME_NAME = 'models--meta-llama--llama-2-7b-chat-hf'
MODEL_NAME = 'meta-llama/Llama-2-7b-chat-hf'
HUGGINGFACE_SECRET_NAME = 'your-huggingface-secret-name'
QUAY_IO_SECRET_NAME = 'your-quay-io-secret-name'
CONFIGMAP_NAME = 'llama2-configmap'

image = Image.from_registry(IMAGE, auth=BasicAuth(
username=Placeholder(QUAY_IO_SECRET_NAME, 'username', kind='Secret'),
password=Placeholder(QUAY_IO_SECRET_NAME, 'password', kind='Secret'),
))

app = App(
APP_NAME,
image=image,
volume_requests=[
VolumeRequest(name=VOLUME_NAME),
],
secret_requests=[
HUGGINGFACE_SECRET_NAME,
QUAY_IO_SECRET_NAME,
],
configmap_requests=[CONFIGMAP_NAME],
autoscaler=FreeWorkerAutoScaler(
# keep running workers even no any requests, that make reaction immediately for new request
min_workers=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='min_workers'),
# the maximum works setting, protect your application avoid to pay a lot of money
# when an attack or sudden traffic
max_workers=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='max_workers'),
# this factor controls autoscaler how to scale up your app
min_free_workers=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='min_free_workers'),
# this factor controls autoscaler how to scale down your app
max_idle_time=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='max_idle_time'),
# this factor controls autoscaler how many steps to scale up your app from queue
scale_up_step=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='scale_up_step'),
),
resource_requests=ResourceRequests(
cpu_num=2,
memory_mb=20480,
gpu_num=1,
gpu_constraints=[
"A100 40G"
],
cpu_constraints=CPUConstraints(
platforms=['amd64', 'arm64']
),
cuda_version_constraints=">=12.4"
),
)

预加载模型

如果你的本地环境没有模型文件,你可以使用LlamaForCausalLM.from_pretrained方法传入MODEL_NAME,从Hugging Face官网拉取模型文件。并且会通过设置cache_dir,把模型文件缓存到私有卷models--meta-llama--llama-2-7b-chat-hf中。

你可以通过everai volume get命令获取到卷models--meta-llama--llama-2-7b-chat-hf的本地路径。进入卷的本地路径后,可以看到已经被缓存的模型文件。

everai volume get models--meta-llama--llama-2-7b-chat-hf
<Volume: id: UoswTkyjkPZU3qb27s4B9E, name: models--meta-llama--llama-2-7b-chat-hf, revision: 000001-821, files: 21, size: 25.11 GiB>
path: /root/.cache/everai/volumes/UoswTkyjkPZU3qb27s4B9E

你在本地环境使用everai app run调试示例代码时,is_prepare_mode的值是False,不会执行把本地文件推送到云端的操作。如果你的本地调试环境有GPU资源,系统会成功执行model.cuda(0)

待你的代码调试通过后,执行everai app prepare命令,该命令会执行所有被@app.prepare注解过的方法,此时is_prepare_mode的值是True,在示例代码中,卷models--meta-llama--llama-2-7b-chat-hf在本地环境中的模型文件会在执行该命令时被推送到云端。

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, PreTrainedTokenizerBase, TextIteratorStreamer

@app.prepare()
def prepare_model():
volume = context.get_volume(VOLUME_NAME)
assert volume is not None and volume.ready

secret = context.get_secret(HUGGINGFACE_SECRET_NAME)
assert secret is not None
huggingface_token = secret.get('token-key-as-your-wish')

model_dir = volume.path

global model
global tokenizer

# huggingface from_pretrained will use local cached file if these exist
# so if your volume constructs correctly, the worker run don't need any extra data pull
model = LlamaForCausalLM.from_pretrained(MODEL_NAME,
token=huggingface_token,
cache_dir=model_dir,
torch_dtype=torch.float16,
local_files_only=context.is_in_cloud)

tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME,
token=huggingface_token,
cache_dir=model_dir,
local_files_only=context.is_in_cloud)

# only in prepare mode push volume
# to save gpu time (redundant sha256 checks)
if context.is_prepare_mode:
context.volume_manager.push(VOLUME_NAME)
else:
if torch.cuda.is_available():
model.cuda(0)

实现推理服务

加载Llama-2(7B)模型后,这里的代码使用了flask实现了文生文的推理在线服务。

import flask
import typing

tokenizer: typing.Optional[PreTrainedTokenizerBase] = None
model = None

# service entrypoint
# api service url looks https://everai.expvent.com/api/routes/v1/default/llama2-7b-chat/chat
# for test local url is http://127.0.0.1/chat
@app.service.route('/chat', methods=['GET','POST'])
def chat():
if flask.request.method == 'POST':
data = flask.request.json
prompt = data['prompt']
else:
prompt = flask.request.args["prompt"]

input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('cuda:0')
output = model.generate(input_ids, max_length=256, num_beams=4, no_repeat_ngram_size=2)
response = tokenizer.decode(output[0], skip_special_tokens=True)

text = f'{response}'

# return text with some information or any other struct that need
resp = flask.Response(text, mimetype='text/plain', headers={'x-prompt-hash': 'xxxx'})
return resp

构建镜像

这步需要使用Dockerfileimage_builder.py来为你的应用构建容器镜像。

这是一个关于image_builder.py的示例代码。

image_builder.py中,你需要配置你的镜像地址信息。

在这个示例中,我们选择使用quay.io作为公共镜像仓库,存放应用镜像。你也可以使用与之类似的知名镜像仓库,如:Docker HubGitHub Container RegistryGoogle Container Registry等。如果你有自建的镜像仓库,并且镜像可以在互联网上被访问,同样可以使用。

from everai.image import Builder

IMAGE = 'quay.io/<username>/<repo>:<tag>'

首先确保你的docker环境处于登录状态,以及你已经安装了docker buildx插件。

docker login quay.io
docker buildx version

然后执行以下命令打包镜像,并且把打包好的镜像推送到你指定的镜像仓库中。

everai image build

部署

最后一步是把你的应用部署到EverAI。并使它保持在运行状态。

everai app create

执行everai app list后,可以看到类似如下的输出结果。CREATED_AT使用UTC时间显示。

如果你的应用状态是DEPLOYED,并且已经准备就绪的worker容器数量等于期望的worker容器数量,即1/1,意味着你的应用已经部署成功。

NAME                   NAMESPACE    STATUS    WORKERS    CREATED_AT
--------------------- ----------- -------- --------- ------------------------
llama2-7b-chat default DEPLOYED 1/1 2024-06-19T08:07:24+0000

你可以使用curl执行下面的请求来测试你部署的代码,在控制台上可以看到针对提问,大模型Llama-2(7B)给出的答案。显示如下的数据信息。

curl -X POST -d '{"prompt": "who are you"}' -H 'Content-Type: application/json' -H'Authorization: Bearer <your_token>' https://everai.expvent.com/api/routes/v1/<your namespace>/<your app name>/chat
who are you?

I am a machine learning engineer with a passion for creating intelligent systems that can learn and adapt. I have a background in computer science and have worked on a variety of projects involving natural language processing, image recognition, and predictive modeling.
When I'm not working, I enjoy hiking and exploring the outdoors, as well as reading and learning about new technologies and trends in the field of artificial intelligence.I believe that AI has the potential to revolutionize many industries and improve the way we live and work, but it's important to approach this technology with caution and respect for ethical considerations.