使用manifest yaml文件搭建Llama-2-7b模型问答服务

在这个示例中，我们通过定义一个manifest yaml文件，即可在EverAI平台上创建一个基于Llama-2(7B)模型的文生文在线问答服务。这种方法无需在你的已有的业务代码中引入EverAI SDK代码，便可以把你的应用部署到EverAI平台。

登录EVERAI客户端

首先，为你的应用创建一个目录，进入应用目录后，需要使用你从EverAI获取到的token进行登录。

everai login --token <your token>

准备存储

创建一个数据卷用于存放模型文件。

everai volume create models--meta-llama--llama-2-7b-chat-hf

你可以通过everai volume get命令获取到卷models--meta-llama--llama-2-7b-chat-hf的本地路径。进入卷的本地路径后，可以把你的模型文件复制到该路径下。

通过everai volume push命令，把数据卷models--meta-llama--llama-2-7b-chat-hf在本地环境中的模型文件，推送到云端。

everai volume push models--meta-llama--llama-2-7b-chat-hf

在当前应用目录下，创建一个软连接来挂载卷models--meta-llama--llama-2-7b-chat-hf的本地路径。

ln -s /root/.cache/everai/volumes/UoswTkyjkPZU3qb27s4B9E volume

编写你的代码

加载模型

这是一个关于app.py的示例代码。

你在本地环境使用python run调试示例代码加载模型时，模型文件会从本地环境的私有卷models--meta-llama--llama-2-7b-chat-hf中读取。如果你的本地调试环境有GPU资源，系统会成功执行model.cuda(0)。

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, PreTrainedTokenizerBase, TextIteratorStreamer

MODEL_NAME = 'meta-llama/Llama-2-7b-chat-hf'
HUGGINGFACE_SECRET_NAME = os.environ.get("HF_TOKEN")

app = Flask(__name__)

VOLUME_NAME = 'volume'

def prepare_model():
    volume = VOLUME_NAME
    huggingface_token = HUGGINGFACE_SECRET_NAME

    model_dir = volume

    global model
    global tokenizer
    
    model = LlamaForCausalLM.from_pretrained(MODEL_NAME,
                                             token=huggingface_token,
                                             cache_dir=model_dir,
                                             torch_dtype=torch.float16,
                                             local_files_only=True)
    
    tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME,
                                               token=huggingface_token,
                                               cache_dir=model_dir,
                                               local_files_only=True)
    
    if torch.cuda.is_available():
        model.cuda(0)

实现推理服务

加载Llama-2(7B)模型后，这里的代码使用了flask实现了文生文的推理在线服务。

import flask
import typing

tokenizer: typing.Optional[PreTrainedTokenizerBase] = None
model = None

# service entrypoint
# api service url looks https://everai.expvent.com/api/routes/v1/default/llama2-7b-chat-manifest-private/chat
# for test local url is http://127.0.0.1:8866/chat
@app.route('/chat', methods=['GET','POST'])
def chat():
    if flask.request.method == 'POST':
        data = flask.request.json
        prompt = data['prompt']
    else:
        prompt = flask.request.args["prompt"]

    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    input_ids = input_ids.to('cuda:0')
    output = model.generate(input_ids, max_length=256, num_beams=4, no_repeat_ngram_size=2)
    response = tokenizer.decode(output[0], skip_special_tokens=True)

    text = f'{response}'
    
    # return text with some information
    resp = flask.Response(text, mimetype='text/plain', headers={'x-prompt-hash': 'xxxx'})
    return resp

实现就绪探针服务

如果设置了就绪探针服务，探针状态准备好（状态码为200）之前，不会有任何客户端请求被路由到这个worker容器中。否则，只要容器状态可用，即使worker容器中的模型文件加载到GPU内存的过程还没完成，EverAI平台就会路由客户端请求到这个worker容器。

目前探针实现只支持HTTP的get和post方法。

@app.route('/healthy-check', methods=['GET'])
def healthy():
    resp = 'service is ready'
    return resp

if __name__ == '__main__':
    prepare_model()
    app.run(host="0.0.0.0", debug=False, port=8866)

构建镜像

这步需要使用Dockerfile来为你的应用构建容器镜像。

这是一个Dockerfile的示例代码。

你可以选择使用公共镜像仓库存放应用镜像。如：quay.io，Docker Hub，GitHub Container Registry，Google Container Registry等。如果你有自建的镜像仓库，并且镜像可以在互联网上被访问，同样可以使用。

首先确保你的docker环境处于登录状态。推荐使用docker buildx构建支持多平台架构的Docker镜像，并把打包好的镜像推送到你指定的镜像仓库中。

创建密钥

密钥管理提供了一种安全的方法，可以添加凭证和其他敏感信息到你的应用容器中。

是否需要创建密钥，取决于下载模型文件和拉取Docker镜像时是否需要安全验证。

你可以在EverAI中创建和编辑密钥，或者通过编写Python代码来管理它。

在这个例子中，我们会为quay.io创建一个密钥。

everai secret create your-quay-io-secret-name \
  --from-literal username=<your username> \
  --from-literal password=<your password>

定义Manifest文件

manifest文件定义了创建EverAI应用所需要的各种信息，包括应用名称，镜像名称，密钥信息，数据卷信息等。

这是一个关于app.yaml的示例代码。

部署

最后一步是把你的应用部署到EverAI。并使它保持在运行状态。

everai app create --from-file app.yaml

执行everai app list后，可以看到类似如下的输出结果。CREATED_AT使用UTC时间显示。

如果你的应用状态是DEPLOYED，并且已经准备就绪的worker容器数量等于期望的worker容器数量，即1/1，意味着你的应用已经部署成功。

NAME                               NAMESPACE    STATUS    WORKERS    CREATED_AT
---------------------------------  -----------  --------  ---------  ------------------------
llama2-7b-chat-manifest-private    default      DEPLOYED  1/1        2024-07-22T04:18:16+0000

你可以使用curl执行下面的请求来测试你部署的代码，在控制台上可以看到针对提问，大模型Llama-2(7B)给出的答案。显示如下的数据信息。

curl -X POST -d '{"prompt": "who are you"}' -H 'Content-Type: application/json' -H'Authorization: Bearer <your_token>' https://everai.expvent.com/api/routes/v1/<your namespace>/<your app name>/chat
who are you?

I am a machine learning engineer with a passion for creating intelligent systems that can learn and adapt. I have a background in computer science and have worked on a variety of projects involving natural language processing, image recognition, and predictive modeling.
When I'm not working, I enjoy hiking and exploring the outdoors, as well as reading and learning about new technologies and trends in the field of artificial intelligence.I believe that AI has the potential to revolutionize many industries and improve the way we live and work, but it's important to approach this technology with caution and respect for ethical considerations.

使用manifest yaml文件搭建Llama-2-7b模型问答服务

登录EVERAI客户端​

准备存储​

编写你的代码​

加载模型​

实现推理服务​

实现就绪探针服务​

构建镜像​

创建密钥​

定义Manifest文件​

部署​