I tried gemma-2b-it-gpu-int4 and gemma-2b-it-cpu-int4 on my phone. I'd like to test a gemma-7b-it-gpu-int4 because 2b was extremely snappy and MLC-LLM could handle a 7b Llama2 so I assume Gemma 7b will fit too.
https://developers.google.com/mediapipe/solutions/genai/llm_inference#models offers 4 2b Gamma out of the box downloadable from Kaggle:
- gemma-2b-it-cpu-int4: Gemma 4-bit model with CPU compatibility.
- gemma-2b-it-cpu-int8: Gemma 8-bit model with CPU compatibility.
- gemma-2b-it-gpu-int4: Gemma 4-bit model with GPU compatibility.
- gemma-2b-it-gpu-int8: Gemma 8-bit model with GPU compatibility.
https://developers.google.com/mediapipe/solutions/genai/llm_inference#convert-model shows a converter, but that has a model_type parameter with values {"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}. Something is missing.
What do I do for 7b Gemma? Looks like I can start off of a PyTorch Gemma 7b bin, but then what should be the converter parameters, especially the model_type so I can end up with a GPU or CPU model which is 4 bit quantized and 16 bit floating point precision?