Figure 3 - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Content may be subject to copyright.
Source publication
This study presents two methods for rapidly and effectively determining the photovoltaic (PV) potential of building roofs in urban areas using aerial photographs and point cloud data. In the first method, the Segment Anything Model (SAM) and Contrastive Language Image Pre-Training (CLIP) models are used to detect roof surfaces and obstacles from ae...
Context in source publication
Context 1
... CLIP model can perform image-text matching using instructions given in natural language. The architecture of CLIP utilizes two separate networks to process language and image data: a Vision Transformer (ViT) model and a language model (a Transformer-based text encoder) (Figure 3). CLIP enhances the learning of semantic congruence between images and associated text by bringing the image and its related text closer together. ...