Llama-2(7B) Chat with public volume
In the quickstart, you created a simple application. In this example, we use the Llama-2(7B)
model files in public volume and implement an AIGC(AI generated content) online question and answer service.
Login EVERAI CLI
Create a directory for your app firstly. In your app directory, you should login by token you got in EverAI.
everai login --token <your token>
Create secrets
This step is optional, if you have already created a secret for the registry you need to access, or if your registry is publicly accessible.
In this case, we will create one secret for quay.io.
everai secret create your-quay-io-secret-name \
--from-literal username=<your username> \
--from-literal password=<your password>
NOTE
quay.io is a well-known public image registry. Well-known image registry similar to quay.io include Docker Hub, GitHub Container Registry, Google Container Registry, etc.
You should create a secret to access Hugging Face as well.
everai secret create your-huggingface-secret-name \
--from-literal token-key-as-your-wish=<your huggingface token>
Create configmap
Optional, but you can use configmap for adjust autoscaling policy after deploying the image.
everai configmap create llama2-configmap \
--from-literal min_workers=1 \
--from-literal max_workers=5 \
--from-literal min_free_workers=1 \
--from-literal scale_up_step=1 \
--from-literal max_idle_time=60
Write your app code in python
Basic setup
There is an example code in app.py.
First, import the required EverAI Python class library. Then define the variable names that need to be used, including the volume, the secret that accesses the image registry, and the file stored in the volume. Use the Image.from_registry
static method to create a image instance. Create and define an app instance through the App class.
What needs to be noted here is that you need to configure GPU resources for your application. The GPU model configured here is "A100 40G", and the number of GPU cards is 1.
from everai.app import App, context, VolumeRequest
from everai_autoscaler.builtin import FreeWorkerAutoScaler
from everai.image import Image, BasicAuth
from everai.resource_requests import ResourceRequests, CPUConstraints
from everai.placeholder import Placeholder
from image_builder import IMAGE
APP_NAME = '<your app name>'
VOLUME_NAME = 'everai/models--meta-llama--llama-2-7b-chat-hf'
MODEL_NAME = 'meta-llama/Llama-2-7b-chat-hf'
HUGGINGFACE_SECRET_NAME = 'your-huggingface-secret-name'
QUAY_IO_SECRET_NAME = 'your-quay-io-secret-name'
CONFIGMAP_NAME = 'llama2-configmap'
image = Image.from_registry(IMAGE, auth=BasicAuth(
username=Placeholder(QUAY_IO_SECRET_NAME, 'username', kind='Secret'),
password=Placeholder(QUAY_IO_SECRET_NAME, 'password', kind='Secret'),
))
app = App(
APP_NAME,
image=image,
volume_requests=[
VolumeRequest(name=VOLUME_NAME),
],
secret_requests=[
HUGGINGFACE_SECRET_NAME,
QUAY_IO_SECRET_NAME
],
configmap_requests=[CONFIGMAP_NAME],
autoscaler=FreeWorkerAutoScaler(
# keep running workers even no any requests, that make reaction immediately for new request
min_workers=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='min_workers'),
# the maximum works setting, protect your application avoid to pay a lot of money
# when an attack or sudden traffic
max_workers=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='max_workers'),
# this factor controls autoscaler how to scale up your app
min_free_workers=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='min_free_workers'),
# this factor controls autoscaler how to scale down your app
max_idle_time=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='max_idle_time'),
# this factor controls autoscaler how many steps to scale up your app from queue
scale_up_step=Placeholder(kind='ConfigMap', name=CONFIGMAP_NAME, key='scale_up_step'),
),
resource_requests=ResourceRequests(
cpu_num=2,
memory_mb=20480,
gpu_num=1,
gpu_constraints=[
"A100 40G"
],
cpu_constraints=CPUConstraints(
platforms=['amd64', 'arm64']
),
cuda_version_constraints=">=12.4"
),
)
Load model
You can load the model using the model file in the public volume everai/models--meta-llama--llama-2-7b-chat-hf
we provide.
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer, PreTrainedTokenizerBase, TextIteratorStreamer
@app.prepare()
def prepare_model():
volume = context.get_volume(VOLUME_NAME)
assert volume is not None and volume.ready
secret = context.get_secret(HUGGINGFACE_SECRET_NAME)
assert secret is not None
huggingface_token = secret.get('token-key-as-your-wish')
model_dir = volume.path
global model
global tokenizer
model = LlamaForCausalLM.from_pretrained(MODEL_NAME,
token=huggingface_token,
cache_dir=model_dir,
torch_dtype=torch.float16,
local_files_only=True)
tokenizer = LlamaTokenizer.from_pretrained(MODEL_NAME,
token=huggingface_token,
cache_dir=model_dir,
local_files_only=True)
if torch.cuda.is_available():
model.cuda(0)
If you want to use everai app run
to debug this example locally, your local debugging environment needs to have GPU resources, and use everai volume pull
command to pull the model file from the cloud to the local environment before debugging the code.
everai volume pull everai/models--meta-llama--llama-2-7b-chat-hf
Generate inference service
Aftering loading Llama-2(7B)
model, now you can write your Python code that uses flask
to implement the inference online service of AIGC(AI generated content).
import flask
import typing
tokenizer: typing.Optional[PreTrainedTokenizerBase] = None
model = None
# service entrypoint
# api service url looks https://everai.expvent.com/api/routes/v1/default/llama2-7b-chat/chat
# for test local url is http://127.0.0.1/chat
@app.service.route('/chat', methods=['GET','POST'])
def chat():
if flask.request.method == 'POST':
data = flask.request.json
prompt = data['prompt']
else:
prompt = flask.request.args["prompt"]
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('cuda:0')
output = model.generate(input_ids, max_length=256, num_beams=4, no_repeat_ngram_size=2)
response = tokenizer.decode(output[0], skip_special_tokens=True)
text = f'{response}'
# return text with some information or any other struct that need
resp = flask.Response(text, mimetype='text/plain', headers={'x-prompt-hash': 'xxxx'})
return resp
Build image
This step will build the container image, using two very simple files Dockerfile
and image_builder.py
.
There is an example code in image_builder.py.
In image_builder.py
, you should set your image repo.
In this example, we choose to use quay.io as the public image registry to store application images. You can also use well-known image registry similar to quay.io, such as Docker Hub, GitHub Container Registry, Google Container Registry, etc. If you have a self-built image registry and the image can be accessed on the Internet, you can also use it.
from everai.image import Builder
IMAGE = 'quay.io/<username>/<repo>:<tag>'
The dependence of this step is docker and buildx installed on your machine. Otherwise we will have further prompts to help you install them.
docker login quay.io
docker buildx version
Then call the following command will compile the image and push them to your specified registry.
everai image build
Deploy image
The final step is to deploy your app to everai and keep it running.
everai app create
After running everai app list
, you can see the result similar to the following. Note that CREATED_AT
uses UTC time display.
If your app's status is DEPLOYED
, and the number of ready worker containers is equal to the expected number of worker containers, which is 1/1
, it means that your app is deployed successfully.
NAME NAMESPACE STATUS WORKERS CREATED_AT
--------------------- ----------- -------- --------- ------------------------
llama2-7b-chat default DEPLOYED 1/1 2024-06-19T08:07:24+0000
When your app is deployed, you can use curl
to execute the following request to test your deployed code, and you can see that Llama-2(7B)
model gives the answers to the question on the console. The following data information is displayed.
curl -X POST -d '{"prompt": "who are you"}' -H 'Content-Type: application/json' -H'Authorization: Bearer <your_token>' https://everai.expvent.com/api/routes/v1/<your namespace>/<your app name>/chat
who are you?
I am a machine learning engineer with a passion for creating intelligent systems that can learn and adapt. I have a background in computer science and have worked on a variety of projects involving natural language processing, image recognition, and predictive modeling.
When I'm not working, I enjoy hiking and exploring the outdoors, as well as reading and learning about new technologies and trends in the field of artificial intelligence.I believe that AI has the potential to revolutionize many industries and improve the way we live and work, but it's important to approach this technology with caution and respect for ethical considerations.