SAM2 (Segment Anything 2) is a brand new mannequin by Meta aiming to phase something in a picture with out being restricted to particular courses or domains. What makes this mannequin distinctive is the dimensions of knowledge on which it was skilled: 11 million pictures, and 11 billion masks. This in depth coaching makes SAM2 a strong place to begin for coaching on new picture segmentation duties.
The query you would possibly ask is that if SAM can phase something why will we even have to retrain it? The reply is that SAM is superb at widespread objects however can carry out somewhat poorly on uncommon or domain-specific duties.
Nonetheless, even in circumstances the place SAM provides inadequate outcomes, it’s nonetheless doable to considerably enhance the mannequin’s potential by fine-tuning it on new information. In lots of circumstances, this can take much less coaching information and provides higher outcomes then coaching a mannequin from scratch.
This tutorial demonstrates how you can fine-tune SAM2 on new information in simply 60 traces of code (excluding feedback and imports).
The total coaching script of the will be present in:
The primary manner SAM works is by taking a picture and a degree within the picture and predicting the masks of the phase that incorporates the purpose. This strategy permits full picture segmentation with out human intervention and with no limits on the courses or kinds of segments (as mentioned in a previous post).
The process for utilizing SAM for full picture segmentation:
- Choose a set of factors within the picture
- Use SAM to foretell the phase containing every level
- Mix the ensuing segments right into a single map
Whereas SAM may make the most of different inputs like masks or bounding containers, these are primarily related for interactive segmentation involving human enter. For this tutorial, we’ll give attention to absolutely automated segmentation and can solely take into account single factors enter.
Extra particulars on the mannequin can be found on the project website.
The SAM2 will be downloaded from:
Should you don’t need to copy the coaching code, you may as well obtain my forked model that already incorporates the TRAIN.py script.
Comply with the set up directions on the github repository.
Normally, you want Python >=3.11 and PyTorch.
As well as, we’ll use OpenCV this may be put in utilizing:
pip set up opencv-python
Downloading pre-trained mannequin
You additionally have to obtain the pre-trained mannequin from:
https://github.com/facebookresearch/segment-anything-2?tab=readme-ov-file#download-checkpoints
There are a number of fashions you may select from all suitable with this tutorial. I like to recommend utilizing the small model which is the quickest to coach.
The subsequent step is to obtain the dataset that will likely be used to fine-tune the mannequin. For this tutorial, we’ll use the LabPics1 dataset for segmenting supplies and liquids. You may obtain the dataset from this URL:
https://zenodo.org/records/3697452/files/LabPicsV1.zip?download=1
The very first thing we have to write is the information reader. It will learn and put together the information for the online.
The information reader wants to supply:
- A picture
- Masks of all of the segments within the picture.
- And a random point inside each mask
Lets begin by loading dependencies:
import numpy as np
import torch
import cv2
import os
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
Subsequent we listing all the photographs within the dataset:
data_dir=r"LabPicsV1//" # Path to LabPics1 dataset folder
information=[] # listing of recordsdata in dataset
for ff, identify in enumerate(os.listdir(data_dir+"Easy/Practice/Picture/")): # go over all folder annotation
information.append({"picture":data_dir+"Easy/Practice/Picture/"+identify,"annotation":data_dir+"Easy/Practice/Occasion/"+identify[:-4]+".png"})
Now for the primary perform that can load the coaching batch. The coaching batch contains: One random picture, all of the segmentation masks belong to this picture, and a random level in every masks:
def read_batch(information): # learn random picture and its annotaion from the dataset (LabPics)# choose picture
ent = information[np.random.randint(len(data))] # select random entry
Img = cv2.imread(ent["image"])[...,::-1] # learn picture
ann_map = cv2.imread(ent["annotation"]) # learn annotation
# resize picture
r = np.min([1024 / Img.shape[1], 1024 / Img.form[0]]) # scalling issue
Img = cv2.resize(Img, (int(Img.form[1] * r), int(Img.form[0] * r)))
ann_map = cv2.resize(ann_map, (int(ann_map.form[1] * r), int(ann_map.form[0] * r)),interpolation=cv2.INTER_NEAREST)
# merge vessels and supplies annotations
mat_map = ann_map[:,:,0] # materials annotation map
ves_map = ann_map[:,:,2] # vessel annotaion map
mat_map[mat_map==0] = ves_map[mat_map==0]*(mat_map.max()+1) # merged map
# Get binary masks and factors
inds = np.distinctive(mat_map)[1:] # load all indices
factors= []
masks = []
for ind in inds:
masks=(mat_map == ind).astype(np.uint8) # make binary masks
masks.append(masks)
coords = np.argwhere(masks > 0) # get all coordinates in masks
yx = np.array(coords[np.random.randint(len(coords))]) # select random level/coordinate
factors.append([[yx[1], yx[0]]])
return Img,np.array(masks),np.array(factors), np.ones([len(masks),1])
The primary a part of this perform is selecting a random picture and loading it:
ent = information[np.random.randint(len(data))] # select random entry
Img = cv2.imread(ent["image"])[...,::-1] # learn picture
ann_map = cv2.imread(ent["annotation"]) # learn annotation
Be aware that OpenCV reads pictures as BGR whereas SAM expects pictures as RGB, utilizing […,::-1] to alter the picture from BGR to RGB.
Be aware that OpenCV reads pictures as BGR whereas SAM expects RGB pictures. By utilizing […,::-1] we alter the picture from BGR to RGB.
SAM expects the picture dimension to not exceed 1024, so we’re going to resize the picture and the annotation map to this dimension.
r = np.min([1024 / Img.shape[1], 1024 / Img.form[0]]) # scalling issue
Img = cv2.resize(Img, (int(Img.form[1] * r), int(Img.form[0] * r)))
ann_map = cv2.resize(ann_map, (int(ann_map.form[1] * r), int(ann_map.form[0] * r)),interpolation=cv2.INTER_NEAREST)
An necessary level right here is that when resizing the annotation map (ann_map) we use INTER_NEAREST mode (nearest neighbors). Within the annotation map, every pixel worth is the index of the phase it belongs to. Consequently, it’s necessary to make use of resizing strategies that don’t introduce new values to the map.
The subsequent block is particular to the format of the LabPics1 dataset. The annotation map (ann_map) incorporates a segmentation map for the vessels within the picture in a single channel, and one other map for the supplies annotation in a distinct channel. We going to merge them right into a single map.
mat_map = ann_map[:,:,0] # materials annotation map
ves_map = ann_map[:,:,2] # vessel annotaion map
mat_map[mat_map==0] = ves_map[mat_map==0]*(mat_map.max()+1) # merged map
What this offers us is a a map (mat_map) wherein the worth of every pixel is the index of the phase to which it belongs (for instance: all cells with worth 3 belong to phase 3). We need to rework this right into a set of binary masks (0/1) the place every masks corresponds to a distinct phase. As well as, from every masks, we need to extract a single level.
inds = np.distinctive(mat_map)[1:] # listing of all indices in map
factors= [] # listing of all factors (one for every masks)
masks = [] # listing of all masks
for ind in inds:
masks = (mat_map == ind).astype(np.uint8) # make binary masks for index ind
masks.append(masks)
coords = np.argwhere(masks > 0) # get all coordinates in masks
yx = np.array(coords[np.random.randint(len(coords))]) # select random level/coordinate
factors.append([[yx[1], yx[0]]])
return Img,np.array(masks),np.array(factors), np.ones([len(masks),1])
That is it! We received the picture (Img), an inventory of binary masks akin to segments within the picture (masks), and for every masks the coordinate of a single level contained in the masks (factors).
Now lets load the online:
sam2_checkpoint = "sam2_hiera_small.pt" # path to mannequin weight
model_cfg = "sam2_hiera_s.yaml" # mannequin config
sam2_model = build_sam2(model_cfg, sam2_checkpoint, gadget="cuda") # load mannequin
predictor = SAM2ImagePredictor(sam2_model) # load web
First, we set the trail to the mannequin weights in: sam2_checkpoint parameter. We downloaded the weights earlier from here. “sam2_hiera_small.pt” confer with the small model however the code will work for any mannequin you select. Whichever mannequin you select it’s good to set the corresponding config file within the model_cfg parameter. The config recordsdata are already positioned within the sub folder “sam2_configs/” of the primary repository.
Earlier than setting coaching parameters we have to perceive the fundamental construction of the SAM mannequin.
SAM consists of three components:
1) Picture encoder, 2) Immediate encoder, 3) Masks decoder.
The picture encoder is accountable for processing the picture and creating the embedding that represents the picture. This half consists of a VIT transformer and is the biggest part of the online. We normally don’t need to practice it, because it already provides good illustration and coaching will demand plenty of assets.
The immediate encoder processes the extra enter to the online, in our case the enter level.
The masks decoder takes the output of the picture encoder and immediate encoder and produces the ultimate segmentation masks. Normally, we need to practice solely the masks decoder and perhaps the immediate encoder. These components are light-weight and will be fine-tuned quick with a modest GPU.
We are able to allow the coaching of the masks decoder and immediate encoder by setting:
predictor.mannequin.sam_mask_decoder.practice(True) # allow coaching of masks decoder
predictor.mannequin.sam_prompt_encoder.practice(True) # allow coaching of immediate encoder
Subsequent, we outline the usual adamW optimizer:
optimizer=torch.optim.AdamW(params=predictor.mannequin.parameters(),lr=1e-5,weight_decay=4e-5)
We additionally going to make use of combined precision coaching which is only a extra memory-efficient coaching technique:
scaler = torch.cuda.amp.GradScaler() # set combined precision
Now lets construct the primary coaching loop. The primary half is studying and getting ready the information:
for itr in vary(100000):
with torch.cuda.amp.autocast(): # solid to combine precision
picture,masks,input_point, input_label = read_batch(information) # load information batch
if masks.form[0]==0: proceed # ignore empty batches
predictor.set_image(picture) # apply SAM picture encoder to the picture
First we solid the information to combine precision for environment friendly coaching:
with torch.cuda.amp.autocast():
Subsequent, we use the reader perform we created earlier to learn coaching information:
picture,masks,input_point, input_label = read_batch(information)
We take the picture we loaded and move it by the picture encoder (the primary a part of the online):
predictor.set_image(picture)
Subsequent, we course of the enter factors utilizing the online immediate encoder:
mask_input, unnorm_coords, labels, unnorm_box = predictor._prep_prompts(input_point, input_label, field=None, mask_logits=None, normalize_coords=True)
sparse_embeddings, dense_embeddings = predictor.mannequin.sam_prompt_encoder(factors=(unnorm_coords, labels),containers=None,masks=None,)
Be aware that on this half we will additionally enter containers or masks however we aren’t going to make use of these choices.
Now that we encoded each the immediate (factors) and the picture we will lastly predict the segmentation masks:
batched_mode = unnorm_coords.form[0] > 1 # multi masks prediction
high_res_features = [feat_level[-1].unsqueeze(0) for feat_level in predictor._features["high_res_feats"]]
low_res_masks, prd_scores, _, _ = predictor.mannequin.sam_mask_decoder(image_embeddings=predictor._features["image_embed"][-1].unsqueeze(0),image_pe=predictor.mannequin.sam_prompt_encoder.get_dense_pe(),sparse_prompt_embeddings=sparse_embeddings,dense_prompt_embeddings=dense_embeddings,multimask_output=True,repeat_image=batched_mode,high_res_features=high_res_features,)
prd_masks = predictor._transforms.postprocess_masks(low_res_masks, predictor._orig_hw[-1])# Upscale the masks to the unique picture decision
The primary half on this code is the mannequin.sam_mask_decoder which runs the mask_decoder a part of the online and generates the segmentation masks (low_res_masks) and their scores (prd_scores).
These masks are in decrease decision than the unique enter picture and are resized to the unique enter dimension within the postprocess_masks perform.
This offers us the ultimate prediction of the online: 3 segmentation masks (prd_masks) for every enter level we used and the masks scores (prd_scores). prd_masks incorporates 3 predicted masks for every enter level however we solely going to make use of the primary masks for every level. prd_scores incorporates a rating of how good the online thinks every masks is (or how positive it’s within the prediction).
Segmentation loss
Now now we have the online predictions we will calculate the loss. First, we calculate the segmentation loss, which suggests how good the expected masks is in comparison with the bottom true masks. For this, we use the usual cross entropy loss.
First we have to convert prediction masks (prd_mask) from logits into chances utilizing the sigmoid perform:
prd_mask = torch.sigmoid(prd_masks[:, 0])# Flip logit map to likelihood map
Subsequent we convert the bottom fact masks right into a torch tensor:
prd_mask = torch.sigmoid(prd_masks[:, 0])# Flip logit map to likelihood map
Lastly, we calculate the cross entropy loss (seg_loss) manually utilizing the bottom fact (gt_mask) and predicted likelihood maps (prd_mask):
seg_loss = (-gt_mask * torch.log(prd_mask + 0.00001) - (1 - gt_mask) * torch.log((1 - prd_mask) + 0.00001)).imply() # cross entropy loss
(we add 0.0001 to forestall the log perform from exploding for zero values).
Rating loss (non-compulsory)
Along with the masks, the online additionally predicts the rating for the way good every predicted masks is. Coaching this half is much less necessary however will be helpful . To coach this half we have to first know what’s the true rating of every predicted masks. Which means, how good the expected masks truly is. We’re going to do it by evaluating the GT masks and the corresponding predicted masks utilizing intersection over union (IOU) metrics. IOU is just the overlap between the 2 masks, divided by the mixed space of the 2 masks. First, we calculate the intersection between the expected and GT masks (the realm wherein they overlap):
inter = (gt_mask * (prd_mask > 0.5)).sum(1).sum(1)
We use threshold (prd_mask > 0.5) to show the prediction masks from likelihood to binary masks.
Subsequent, we get the IOU by dividing the intersection by the mixed space (union) of the expected and gt masks:
iou = inter / (gt_mask.sum(1).sum(1) + (prd_mask > 0.5).sum(1).sum(1) - inter)
We going to make use of the IOU because the true rating for every masks, and get the rating loss as absolutely the distinction between the expected scores and the IOU we simply calculated.
score_loss = torch.abs(prd_scores[:, 0] - iou).imply()
Lastly, we merge the segmentation loss and rating loss (giving a lot larger weight to the primary):
loss = seg_loss+score_loss*0.05 # combine losses
As soon as we get the loss all the pieces is totally customary. We calculate backpropogation and replace weights utilizing the optimizer we made earlier:
predictor.mannequin.zero_grad() # empty gradient
scaler.scale(loss).backward() # Backpropogate
scaler.step(optimizer)
scaler.replace() # Combine precision
We additionally need to save the skilled mannequin as soon as each 1000 steps:
if itrpercent1000==0: torch.save(predictor.mannequin.state_dict(), "mannequin.torch") # save mannequin
Since we already calculated the IOU we will show it as a transferring common to see how properly the mannequin prediction are bettering over time:
if itr==0: mean_iou=0
mean_iou = mean_iou * 0.99 + 0.01 * np.imply(iou.cpu().detach().numpy())
print("step)",itr, "Accuracy(IOU)=",mean_iou)
And that it, now we have skilled/ fine-tuned the Section-Something 2 in lower than 60 traces of code (not together with feedback and imports). After about 25,000 steps it is best to see main enchancment .
The mannequin will likely be saved to “mannequin.torch”.
You’ll find the total coaching code at:
To see how you can load and use the mannequin we simply skilled verify the following part.
Now that the mannequin as been fine-tuned, let’s use it to phase a picture.
We going to do that utilizing the next steps:
- Load the mannequin we simply skilled.
- Give the mannequin a picture and a bunch of random factors. For every level the online will predict the phase masks that comprise this level and a rating.
- Take these masks and sew them collectively into one segmentation map.
The total code for doing that’s obtainable at:
First, we load the dependencies and solid the weights to float16 this makes the mannequin a lot quicker to run (solely doable for inference).
import numpy as np
import torch
import cv2
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor# use bfloat16 for your entire script (reminiscence environment friendly)
torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
Subsequent, we load a pattern image and a masks of the picture area we need to phase (obtain image/mask):
image_path = r"sample_image.jpg" # path to picture
mask_path = r"sample_mask.png" # path to masks, the masks will outline the picture area to phase
def read_image(image_path, mask_path): # learn and resize picture and masks
img = cv2.imread(image_path)[...,::-1] # learn picture as rgb
masks = cv2.imread(mask_path,0) # masks of the area we need to phase# Resize picture to most dimension of 1024
r = np.min([1024 / img.shape[1], 1024 / img.form[0]])
img = cv2.resize(img, (int(img.form[1] * r), int(img.form[0] * r)))
masks = cv2.resize(masks, (int(masks.form[1] * r), int(masks.form[0] * r)),interpolation=cv2.INTER_NEAREST)
return img, masks
picture,masks = read_image(image_path, mask_path)
Pattern 30 random factors contained in the area we need to phase:
num_samples = 30 # variety of factors/phase to pattern
def get_points(masks,num_points): # Pattern factors contained in the enter masks
factors=[]
for i in vary(num_points):
coords = np.argwhere(masks > 0)
yx = np.array(coords[np.random.randint(len(coords))])
factors.append([[yx[1], yx[0]]])
return np.array(factors)
input_points = get_points(masks,num_samples)
Load the usual SAM mannequin (identical as in coaching)
# Load mannequin it's good to have pretrained mannequin already made
sam2_checkpoint = "sam2_hiera_small.pt"
model_cfg = "sam2_hiera_s.yaml"
sam2_model = build_sam2(model_cfg, sam2_checkpoint, gadget="cuda")
predictor = SAM2ImagePredictor(sam2_model)
Subsequent, Load the weights of the mannequin we simply skilled (mannequin.torch):
predictor.mannequin.load_state_dict(torch.load("mannequin.torch"))
Run the fine-tuned mannequin to foretell a segmentation masks for each level we chosen earlier:
with torch.no_grad(): # forestall the online from caclulate gradient (extra environment friendly inference)
predictor.set_image(picture) # picture encoder
masks, scores, logits = predictor.predict( # immediate encoder + masks decoder
point_coords=input_points,
point_labels=np.ones([input_points.shape[0],1])
)
Now now we have an inventory of predicted masks and their scores. We need to in some way sew them right into a single constant segmentation map. Nonetheless, most of the masks overlap and may be inconsistent with one another.
The strategy to sewing is straightforward:
First we’ll type the expected masks in keeping with their predicted scores:
masks=masks[:,0].astype(bool)
shorted_masks = masks[np.argsort(scores[:,0])][::-1].astype(bool)
Now lets create an empty segmentation map and occupancy map:
seg_map = np.zeros_like(shorted_masks[0],dtype=np.uint8)
occupancy_mask = np.zeros_like(shorted_masks[0],dtype=bool)
Subsequent, we add the masks one after the other (from excessive to low rating) to the segmentation map. We solely add a masks if it’s in step with the masks that had been beforehand added, which suggests provided that the masks we need to add has lower than 15% overlap with already occupied areas.
for i in vary(shorted_masks.form[0]):
masks = shorted_masks[i]
if (masks*occupancy_mask).sum()/masks.sum()>0.15: proceed
masks[occupancy_mask]=0
seg_map[mask]=i+1
occupancy_mask[mask]=1
And that is it.
seg_mask now incorporates the expected segmentation map with totally different values for every phase and 0 for the background.
We are able to flip this right into a colour map utilizing:
rgb_image = np.zeros((seg_map.form[0], seg_map.form[1], 3), dtype=np.uint8)
for id_class in vary(1,seg_map.max()+1):
rgb_image[seg_map == id_class] = [np.random.randint(255), np.random.randint(255), np.random.randint(255)]
And show:
cv2.imshow("annotation",rgb_image)
cv2.imshow("combine",(rgb_image/2+picture/2).astype(np.uint8))
cv2.imshow("picture",picture)
cv2.waitKey()
The total inference code is accessible at: