Automatic conversion of monoscopic image/ video to stereo for 3D visualization

(1)

International Journal of Recent Advances in Engineering & Technology (IJRAET)

________________________________________________________________________________________________

ISSN (Online): 2347 - 2812, Volume-3, Issue -4, 2015 5

Automatic conversion of monoscopic image/ video to stereo for 3D visualization

1R.C.Gokul Nanda Kumar, ²Vijaykumar T

1 4^th sem, M.Tech (Digital Electronics), SJBIT, Bangalore

2 Assoc Prof, Dept. of ECE, SJBIT, Bangalore Email: ¹[email protected]

Abstract: We developed a method that automatically converts a monoscopic video into a stereoscopic counterpart for 3D visualization by generating the depth maps. Our technique can be used for any arbitrary image as well as videos. Though the concept of stereoscopy has existed through years, but there is a requirement of developing real time 3D broadcasting from conventional 2D broadcasting. Conventionally the conversion of 2D-to- 3D methods used a semi-automatic technique in which there was a requirement of human operator and it was time consuming. The fully automatic process was difficult to implement and also required a strong assumptions about the 3D scenes. Our method is based on learning and extraction of depth maps by transferring the labels for query image from the images with known depths pairs that are present in the repository For videos, we use localized motion information for the improvement of the inferred depth maps. To maintain consistency between the temporal depths, an optical flow method has been used. Here, for learning and analysis, a kinetic based camera system is used. This system generates large datasets that contains stereoscopic video with known depths pairs.

Key words: 3D conversions, 2D-to-3D, depth extraction, mono to stereoscopic images/videos, depth maps, Label transfer.

I. INTRODUCTION:

Generating an illusion of depth in an image or sequence of image by means of perception of depth and 3D structures which are obtained as a result of binocular vision. This process is termed as stereoscopy or 3D imaging. Most successful stereoscopy technique provides two set of images of the same object with slightly different angles to left and right eye of the observer. Later in the brain these images are integrated to produce an illusion of 3-dimensional depth [3].

The ability to interpret the world in three dimensions and the location of an object from the camera or human eye is called depth perception. The depth of a distant object is less perceived than the object which is near.

The depth perception can be obtained from a different variation of depth cues. The depth cues can be obtained in two major cues, one of them is the binocular cues that can be represented in three dimensions and observed from both the eyes, while the other one is the monocular

cues; it is represented in two dimensions that are perceived with just single eye. For an observer in motion, the relative distance of the stationary objects against a background can be obtained by the relative motion of the objects [14]. The motion parallax can provide the absolute depth information, provided the direction and the velocity of movement. This parallax effect can be better seen while driving a vehicle. The objects which are nearby will appear to move faster, while the distant objects looks stationary. Binocular cues render the depth information by viewing scenes from dual eyes. Retinal disparity or binocular parallax can provide absolute depth information, if the images of the same object obtained from slightly different angles are provided.

An important step in the conversion of stereoscopy is generation of scene depths. The scene depth significantly reduces the effort of converting the monoscopic video to their stereoscopic counterparts [4].

Scene depths can be obtained from specially designed cameras that can directly generate 3D model or images along with their known depths. These cameras are also called kinetic cameras or depth-based cameras, which consists of conventional dual cameras with laser beam to estimate the depth. In this paper, we demonstrate a novel solution to generate depth maps from arbitrary 2d videos; this method can also be used to single images.

II. LITERATURE SURVEY:

In this section, we briefly survey the methods that are related to our work, namely the conversion techniques from 2D-to-3D images/videos.

In the conventional techniques that was proposed by Horry et al. [3] used semiautomatic approach; here the skilled user interactively assigns planes to significant parts in an image or sequence of image. A similar approach was described by oh et al. [5] and Zhang et al.

[6].Basically the conversion techniques are broadly defined into; a standard semiautomatic conversion technique for 3DTV and other 3D cinema production and low quality 3DTV and other such application. In the first approach namely the semiautomatic process, an operator interactively applies depth maps to different

(2)

________________________________________________________________________________________________

parts of an image or sequence of image. Later by using these coarse depths assignment is considered as the base, a computer algorithm evaluates the overall dense depth present in an image or sequence of images. On the other hand, in fully automatic methods, there is no requirement of user interaction. This approach is based on the assumptions about the stereoscopic scenes.

However such assumptions work well for specific conditions. Generally it’s a tedious task to enable the system to apply the assumptions for all possible background and the foreground combinations.

The spatial depth assignment method was described as a labor-intensive step in order to reduce the involvement of the user in the semiautomatic conversion method. The extraction of dense depth maps from its corresponding coarse depth maps was proposed by Guttmann et al. [7].

Agnot et al. [8] described the method of extraction by using cross-bilateral filtering. All these described approaches used a repository for the assignment of initial depth planes by the user interaction. A better approach was developed by Liao et al. [9], in which the efforts of the user were further reduced. Here, initially the optical flow characteristic features of the moving objects was estimated, then applying the motion estimation and later the boundaries of the moving objects was computed. The errors that are generated as a result of automatic conversion of depth planes are rectified by the user. The user is also responsible for the assignment of depth planes to undefined areas.

In the recent years, machine- based learning methods are employed, in which the system evaluates the depth maps of a single monoscopic images [10], [11]. These methods have the ability to automatically extract depth planes, but recently these techniques are applicable to fewer images. Basically these methods use large repository that contain the images along with their depth planes. These databases are used for many computer vision tasks, such as image tracking [13], object recognitions [12].

In video stereoscopic conversion [15], there are ample of methods which described the conversion of mono videos to their stereo counterparts, but most of them require the involvement of the user. This process is carried out by combining the efforts of the user by applying the planes to the scenes and motion estimation features. The optical flow features are used in order to generate general motions that are independent from the restrictions imposed by the structure from motion methods, thus further reducing the efforts of the user. Similar to the single image mono to stereo conversion, the query video is subjected to the moving object segmentation and inferred depth correction. The structure to motion (SFM) process adds the coarse depth planes in the form of stereo points, the foreground and the background are segregated by the motion object segmentation (MOS) automatically and finally the inferred depth correction (IDC) performs an inspection on the object sizes that changes from frame to frame in order to evaluate the amount of depth changes between the sequence of

images. After the entire process of automatic conversion, the result is subjected to user inspection, which depicts the area with known depth planes. The user applies the coarse planes to the undefined areas.

Both the efforts are integrated to generate a dense depth for the video frames. Apart from the semiautomatic conversion process, few automatic processes are available in the form of hardware systems to generate real-time 3D videos from their 2D counterparts. But these methods works well with assumptions which are limited to the type of query images or sequence of images.

III. SYSTEM DESIGN:

Though the development of the 3D content is well established, still the 2D content dominates the 3D content. In order to reduce this gap between these two contents, many approaches are developed. In the semiautomatic conversion process requires the involvement of the user and time consuming [14]. On the other hand, fully automatic conversion process is developed on the basis of stereo deterministic models.

This works optimal for limited combinations of foreground and background scenes. But in the 2D-to-3D machine learning process has an advantage over throughput and this method can be applied to any real- time mono to stereo images or videos. Our technique is developed on the basis of two criteria; primarily, there exist ample number of images with their known depth planes in open source that are similar to the query image which needs to be converted to stereo. Secondly, two images will have the identical 3d characteristics when they are photo metrically identical.

Figure 1: Block view of proposed method In the figure.1, the query image/ sequence of images are matched with those that are available in the repository.

The GIST matching [2] and the SIFT flow [1] features are generated to produce a coarse depth planes. Finally the depth optimization is carried out in order to generate a dense depth plane that can be perceived by the humans. Similarly for videos the optical flow features, motion estimation is computed in order to reduce the temporal discontinuities [15].

In our technique, we implement the method of nonparametric image parsing approach through serial transmission of labels for depth estimation [13]. Instead

(3)

________________________________________________________________________________________________

of training the system with complicated parametric model, the system is made to infer the query image by transferring the labels of the images along with their depth maps that are available in the large repository; this also reduces the time consumption for training. The main goal of non-parametric image parsing process is based on object identification by matching, the input image is resolved in to its components based on the visual objects that are identical with the images that are present in the large repository. Initially the best matches are obtained from the database using the GIST matching features [2], and SIFT flow features [1]. Finally the labels from these best matches are transferred to the query image. Here we use a correlation matching regression to match the input image with the images in the database; the end result will be the best matches, which are termed as candidates.

Once the candidates are selected from the repository, the SIFT algorithm [1] is applied; which generates semantically meaningful correlation between the query image and the database images using the localized SIFT descriptors. The iteration of this method is carried out for improved performance [13]. In order to warp the candidates, optical flow is used along with GIST matching [2] features. For video sequence we consider an identical wrap between the successive frames. Finally to estimate the amount of similarity between the two images/ video we consider the linear combination of optical flow features and difference in GIST features [2]

as mentioned above.

Let the GIST features be denoted as S1 and S2 and the corresponding flow features as O1 and O2, and then we can evaluate the amount of matching score by using:

1 − α S1 − S2 + α O1 − O2 (1) Where α = 0.5 in our implementation.

The individual candidates that are wrapped by the SIFT flow [1] will contain rough estimation of the query image’s depth map. In such candidates, the depth mappings are certainly contain imprecision and often are spatially uneven. Alternatively, by considering all the wrapped candidates, we produce the depth maps that are spatially even by optimizing with regularization of spatial components [15].

Let I be the query image and M be the depth map then we interpret

A M = − log P M I = i∈pixels At Mi +

γAs M_i + δA_p M_i + log N (2)

Where γ and δ are parameters (γ = 10, δ = 0.5) and N is the normalization constant. A_t is the data term, A_s is the spatial smoothness term and A_p data prior term.

In order to improve the quality of the depth estimation for sequence of images, we incorporate temporal information form frame-to–frame that provide temporal

continuities. Our technique is applicable for arbitrary video sequence with motion objects and has provision for zooming / rotating views of the scenes where the conventional 3D processes fail to interpret [15]. Here, similar to single input image, frames are processed individually and global optimization is applied over the entire sequence of images at once along with the consideration of temporal information and motion estimation features.

We append two terms in the single-image formulation equation (2), we get

Evideo = A M + _i∈pixels θAc M_i + λA_m M_i (3)

Where Ac and Am are temporal coherence and motion estimation feature cues respectively. The parameters θ and λ stable the effects of temporal and motion cues (θ

= 100, λ = 5).

IV. RESULTS AND INFERENCES:

Figure.2. Input sequence of images

Figure.3. Inferred depth maps of the input sequence of images.

The results infer that our approach is applicable to any random monoscopic images with any combination of the foreground and the background scenes. This works well for all indoor and outdoor image sequence with reduced time period for the conversion. For the cases in which video involving static scenes and dynamic camera are better handled by the method proposed by G.Zhang et al.[16]. In contrast, our algorithm can manage ample amounts of depth data information with less percentage of errors in the output image/ video sequence. The only disadvantage of this technique is the more the data, the more the comparison between the candidates. The GIST feature [2] provides the pixel-pixel consistency between

(4)

________________________________________________________________________________________________

the query image and the images available in the repository.

In some scenarios, the segmentation of motion incorrectly identifies the motion pixels; as a result this may degrade the quality of the 3D analysis. Since the entire method is serial in nature, this may lead to propagation errors through the stages. Hence care must be taken at every stage in order to rectify the depth issues.

V. CONCLUSION:

The main concept of the proposed method is to convert the monoscopic single image or sequence of images into their corresponding stereoscopic counterparts. Our method can be applied to any arbitrary single images or sequence of images. In the literature many methods have been proposed to convert from mono to stereo videos, but they required the involvement of the user for coarse depth plane allocation and later required an algorithm to generate the dense depth mapping to the query image/

video. But the fully automatic process is constructed on the basis of 3D deterministic models which are limited due to the limited set of training data sets.

Our technique is independent of video training data and hence the time to compute is reduced. This process can be applied to arbitrary videos that contain any combination of fore ground and background scenes.

This process considers the temporal information in order to avoid discontinuities between the frames. Here we also maintain the pixel-pixel features in order to retain the salient features that avoid disparity in the result [15].

In future the YouTube and other software packages will implement stereo viewing and uploading provisions.

REFERENCES:

[1] C. Liu, J. Yuen, and A. Torralba, “SIFT Flow:

Dense correspondence across scenes and its applications,” IEEE TPAMI, vol. 33, no. 5, pp.

978–994, May 2011.

[2] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” IJCV, vol. 42, pp. 145–175, 2001.

[3] Y. Horry, K. Anjyo, and K. Arai, “Tour into the picture: Using a spidery mesh interface to

make animation from a single image,”

SIGGRAPH, 1997.

[4] Jean Maria Dominic, “recent trend in 2D to 3D conversion: A survey”: International journal for research in applied science and Engineering Technology, April 2014

[5] B. Oh, M. Chen, J. Dorsey, and F. Durand,

“Image-based modeling and photo editing,”

SIGGRAPH, 2001.

[6] L. Zhang, G. Dugas-Phocion, J.-S. Samson, and S. Seitz, “Single view modeling of free-form scenes,” Journal of Visualization and Computer Animation, vol. 13, no. 4, pp. 225–235, 2002.

[7] M. Guttmann, L. Wolf, and D. Cohen-Or,

“Semiautomatic stereo extraction from video footage”, in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2009, pp. 136_142.

[8] L. Angot, W.-J. Huang, and K.-C. Liu, “A 2D to 3D video and image conversion technique based on a bilateral” lter,_ Proc. SPIE, vol. 7526, p.

75260D, Feb.2010.

[9] M. Liao, J. Gao, R. Yang, and M. Gong, “Video stereolization: Combining motion analysis with user interaction,” IEEE Trans. Visualizat.

Comput. Graph., vol. 18, no. 7, pp. 1079_1088, Jul. 2012.

[10] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from predicted semantic labels,” in Proc. IEEE Conf. Comput. Vis.

Pattern Recognit., Jun. 2010, pp. 1253_1260.

[11] A. Saxena, M. Sun, and A. Ng, _Make3D:

Learning 3D scene structure from a single still image,_ IEEE Trans. Pattern Anal. Mach.

Intell.,vol. 31, no. 5, pp. 824_840, May 2009.

[12] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,”

IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958_1970, Nov. 2008.

[13] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing:Label transfer via dense scene alignment,” CVPR, 2009.

[14] Jean Maria Dominic and Arsha J K, “Automatic 2D-to-3D Image and Video Conversion by Learning Examples and Dual Edge-Confined Inpainting,” International Journal of Advanced Research in Computer Science and Software Engineering, August 2014.

[15] Kevin Karsch, Ce liu, and Sing Bing Kang,

“Depth Transfer: Depth Extraction from Video using Non-parametric sampling,” IEEE Trans.

Pattern Anal. 2013.

[16] G. Zhang, J. Jia, W. Hua, and H. Bao, “Robust bilayer segmentation and motion/depth estimation with a handheld camera,” IEEE TPAMI, vol. 33, no. 3, pp. 603–617, 2011.

