I'm busy trying to use ort to run the ONNX models of CLIP, hosted here:
https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/visual.onnx
https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/textual.onnx
(Which I found in https://github.com/Lednik7/CLIP-ONNX)
I can get the model to work, but only if I provide it exactly 77 tokens. I'm hoping to figure out how to get it to work with arbitrary numbers of tokens.
Here's the code that works, but I've had to make the input string be exactly 77 tokens long:
use instant_clip_tokenizer::{Token, Tokenizer};
use ndarray::{Array1, Array2, Axis};
use ort::{inputs, GraphOptimizationLevel, Session};
pub fn load_text_model() -> ort::Result<()> {
// The ONNX file has already been saved to models/textual.onnx
// `wget https://clip-as-service.s3.us-east-2.amazonaws.com/models/onnx/ViT-B-32/textual.onnx -O models/textual.onnx`
let text_model = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.with_intra_threads(1)?
.with_model_from_file("models/textual.onnx")?;
// The tokenizer comes from
// https://docs.rs/instant-clip-tokenizer/0.1.0/instant_clip_tokenizer
let tokenizer = Tokenizer::new();
// See `tokenize(...)` below. The string I give here is just a dummy piece of text that
// ends up being 77 tokens long.
let tokens = tokenize(tokenizer, "Hi there my name is john and I like to walk in the park with my son and daughter. when we go walking in the sun I like to feel it warm my neck and I like to hold their hands as they tell me about their day. sometimes they have had a poor day and it makes me sad to hear about their poor day but other times I hear about")
.iter()
.map(|tk| *tk as i64)
.collect::<Vec<_>>();
let mut tokens = Array1::from_iter(tokens);
// Preprocess the tokens into the right shape
let array = tokens.view().insert_axis(Axis(0));
let inputs = inputs!["input" => array]?;
// Pass the inputs through the model
let model_output = text_model.run(inputs)?;
// Extract the embedding from the model
let outputs = model_output["output"].extract_tensor::<f32>()?;
// This tensor is correct, I've verified it with a Python CLIP model
println!("Output Tensor: {:?}", outputs);
Ok(())
}
fn tokenize(tokenizer: Tokenizer, text: &str) -> Vec<u16> {
let mut tokens = vec![tokenizer.start_of_text()];
tokenizer.encode(text, &mut tokens);
tokens.push(tokenizer.end_of_text());
tokens.into_iter().map(Token::to_u16).collect()
}
If I change the string to be a bit shorter or a bit longer:
...
let tokens = tokenize(tokenizer, "short string")
...
Then I get this error:
called `Result::unwrap()` on an `Err` value: SessionRun(Msg("
Got invalid dimensions for input: input for the following indices\n
index: 1 Got: 4 Expected: 77\n
Please fix either the inputs or the model."))
I feel like maybe I'm missing some "padding" tokens to make my input be 77 tokens long, but I'm not sure how I'd go about adding those? Also I'm not sure what to do if I want to input a piece of text that's longer than 77 tokens?