Cloud Studio Deploying Qwen2.5-1.5B-Instruct Large Model via vllm#
ps: Originally, I intended to deploy the qwq:32b large model, but even with the advanced version of the hai server, it was unsuccessful (due to insufficient VRAM and memory, all attempts ended in failure). In the end, I had to choose to deploy the wen2.5-1.5B-Instruct model for testing.
Here we use the high-performance space of cloud studio, basic version.
First, install vllm
python -m pip install --upgrade pip
The waiting time may be a bit longer, please be patient.
Install vllm pip install vllm
After successful installation, we enter the command vllm
for a simple test to see if the installation is normal.
Next, we install pip install modelscope
Install pip install openai
Install pip install tqdm
and pip install transformers
The section of the dividing line can be ignored, no need to run it.
----------------- Dividing Line Start -----------------
1. Create a tmp folder in the current directory mkdir tmp
, or create it directly.
Create model_download_32b.py with the following code:
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/QwQ-32B', cache_dir='./tmp', revision='master')
2. Run model_download_32b.py, it will download the qwq32b model. Since my machine resources are in Singapore, the speed is relatively slow.
python model_download_32b.py
Need to wait a bit, U_U ~~
----------------- Dividing Line End -----------------
Due to the machine's resources being in the Singapore data center, accessing domestic models from the Magic Tower community is relatively slow.
You can use git lfs clone to use hf's model files, which will be faster.
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
git lfs clone https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
This is to pull the Qwen2.5-1.5B-Instruct model from hf.
Due to VRAM and memory limitations, I tested multiple models and all declared failures, so I chose the Qwen2.5-1.5B-Instruct large model for testing. This blog post is about a machine deployed using hai. Unfortunately, the previous models were not supported, which was a waste of effort.
Okay, let's continue and wait for the pull to complete.
Create a server compatible with the OpenAI API interface.
vllm specific usage can be found in the official documentation
python -m vllm.entrypoints.openai.api_server \
--model ./Qwen2.5-1.5B-Instruct \
--served-model-name Qwen2.5-1.5B \
--max-model-len=2048 \
--dtype=half
Seeing this interface means the deployment was successful.
We open the URL https://ohaxxx.ap-singapore.cloudstudio.work/proxy/8000/version to check if it can be accessed normally.
Next, we will configure the client, the usual process.
No matter how it's configured, it doesn't work, we can only try to bypass it.
Old command ssh srv.us -R 1:localhost:8000
If there is an error, create a key according to the prompt.
ssh-keygen -t ed25519
Just press enter for all defaults, then rerun the command ssh srv.us -R 1:localhost:8000
.
Configure the client chatx.
Make sure the URL does not end with a slash.
Test in the dialog box.
Finally, the deployment of the large model using vllm is complete. This can be considered a general deployment tutorial, as long as the memory and VRAM are sufficient, it theoretically supports deploying any large model on hf. The tutorial ends here.
After nearly 6 hours, using hai for almost 3 hours. About 3.5 per hour... The key is that it was finally written, and I did not use the custom machine of hai, encountering countless pitfalls, and finally finished.
U_U ~_~ D_D
Final shout, can you reimburse me for the costs incurred using hai!!!
Can you reimburse me for the costs incurred using hai!!!
Reimburse the costs!!!