Transcription

ACTIVE STEREO VISION: DEPTH PERCEPTION FOR NAVIGATION,ENVIRONMENTAL MAP FORMATION AND OBJECT RECOGNITIONA THESIS SUBMITTED TOTHE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCESOFTHE MIDDLE EAST TECHNICAL UNIVERSITYBYLKAY ULUSOYIN PARTIAL FULLFILLMENT OF THE REQUIREMENTS FOR THEDEGREE OFDOCTOR OF PHILOSOPHYINTHE DEPARTMENT OF ELECTRICAL AND ELECTRONICSENGINEERINGSEPTEMBER 2003

Approval of the Graduate School of Natural and Applied SciencesProf. Dr. Canan ÖzgenDirectorI certify that this thesis satisfies all the requirements as a thesis for the degree of Doctorof Philosophy.Prof. Dr. Mübeccel DemireklerHead of DepartmentThis is to certify that we have read this thesis and that in our opinion it is fully adequate,in scope and quality, as a thesis for the degree of Doctor of Philosophy.Prof. Dr. U ur HalıcıSupervisorExamining Committee MembersProf. Dr. Kemal Leblebicio luProf. Dr. U ur HalıcıProf. Dr. Hasan GüranAssoc. Prof. Dr. Volkan AtalayProf. Dr. Erhan Nalçacı

ABSTRACTACTIVE STEREO VISION: DEPTH PERCEPTION FOR NAVIGATION,ENVIRONMENTAL MAP FORMATION AND OBJECT RECOGNITIONUlusoy, lkayPh. D., Department of Electrical and Electronics EngineeringSupervisor: Prof. Dr. U ur Halıcı September 2003, 148 pagesIn very few mobile robotic applications stereo vision based navigation and mappingis used because dealing with stereo images is very hard and very time consuming.Despite all the problems, stereo vision still becomes one of the most importantresources of knowing the world for a mobile robot because imaging provides muchmore information than most other sensors. Real robotic applications are verycomplicated because besides the problems of finding how the robot should behaveto complete the task at hand, the problems faced while controlling the robot’sinternal parameters bring high computational load. Thus, finding the strategy to befollowed in a simulated world and then applying this on real robot for realapplications is preferable. In this study, we describe an algorithm for objectiii

recognition and cognitive map formation using stereo image data in a 3D virtualworld where 3D objects and a robot with active stereo imaging system aresimulated. Stereo imaging system is simulated so that the actual human visualsystem properties are parameterized. Only the stereo images obtained from thisworld are supplied to the virtual robot. By applying our disparity algorithm, depthmap for the current stereo view is extracted. Using the depth information for thecurrent view, a cognitive map of the environment is updated gradually while thevirtual agent is exploring the environment. The agent explores its environment in anintelligent way using the current view and environmental map information obtainedup to date. Also, during exploration if a new object is observed, the robot turnsaround it, obtains stereo images from different directions and extracts the model ofthe object in 3D. Using the available set of possible objects, it recognizes the object.Keywords: stereo vision, active vision, disparity, depth perception, environmentalmap, object recognitioniv

ÖZAKT F STEREO GÖRME: LERLEME, ÇEVRESEL HAR TA ÇIKARMA VENESNE TANIMA AMAÇLARI Ç N DER NL K ALGILANMASIUlusoy, lkayDoktora, Elektrik ve Elektronik Mühendisli i Bölümü Tez Yöneticisi: Prof. Dr. U ur Halıcı Eylul 2003, 148 sayfaStereo görme analizi çok zor ve zaman alıcı oldu u için, robot çalı malarında çok sık tercih edilen bir yöntem de ildir. Buna ra men, hareketli bir robot için çevrenin daha detaylı bilinmesi açısından stereo görüntüleme en temel kaynak olarak tercihedilmeye ba lanmı tır. Bunun en temel nedeni, görüntülemenin analizi çok zorolmasına ra men di er sensörlere nazaran çok daha fazla bilgi sa lıyor olmasıdır. Gerçek robot uygulamaları çok karma ıktır. Bu nedenle, robotun nasıl davranmasıgerekti inin bulunması amaçlanıyorsa öncelikle simülasyonlar üzerinde çalı ıp daha sonra bulunan stratejinin gerçek robot üzerinde uygulanması tercih edilen biryöntemdir. Bu çalı mada, üç boyutlu sanal bir ortam olu turulmu tur. Bu sanalortamda üç boyutlu nesneler ve aktif stereo görme sistemine sahip sanal bir robotv

yer almaktadır. Bu sanal ortamdan alınan stereo görüntüler kullanılarak sanalrobotun nesne tanıması ve çevresel harita çıkarması hedeflenmi tir. Stereogörüntüleme sistemi, gerçek insan görme sistemi özelliklerine göre simüleedilmi tir. Sanal robot, sadece stereo görüntüleri kullanmaktadır. Farklılıkalgoritmamız kullanılarak stereo görüntülerden o anki görme alanı için derinlikbilgisi çıkarılmaktadır. Robot akıllı bir ekilde etrafı tararken, derinlik bilgisikullanılarak kognitif harita sürekli doldurulmaktadır. Robot, ortamda ilerlemeyi oanki görsel bilgi ve o ana kadar olu turulmukognitif harita yardımıylagerçekle tirmektedir. Aynı zamanda robot, ortamda dola ırken yeni bir nesne ilekar ıla ırsa, nesnenin etrafında dönerek farklı yönlerden stereo görüntüsünüçekmekte ve üç boyutlu modelini çıkarmaktadır. Daha önceden tanımlanmıolabilecek nesneler arasından, görmü oldu u nesneyi üç boyutlu yapı bilgisinden çıkarmaktadır.Anahtar Kelimeler: stereo görme, aktif görme, farklılık, derinlik algısı, çevreselharita çıkarmak, nesne tanımavi

To My Father, Mother, Brother and Lovely Daughtervii

ACKNOWLEDGEMENTSI would like to express my gratitude to my supervisor Prof. Dr. U ur Halıcı for her guidance and support throughout the research. I would also like to thank to Prof. Dr.Kemal Leblebicio lu, Asst. Prof. Dr. Volkan Atalay and Prof. Dr. Edwin Hancock for their contributions.I would like to acknowledge TUB TAK-BAYG for the scholarship that covered mystudies at the University of York, UK and thank to both my supervisor andTUB TAK for providing me such a possibility.viii

TABLE OF CONTENTSABSTRACT . IIIÖZ. VDEDICATION.VIACKNOWLEDGEMENTS . VIIILIST OF FIGURES.XIILIST OF TABLES . XVCHAPTER1. INTRODUCTION. 11.1Problem Definition and Motivation . 11.2Contribution . 61.3Organization of the Thesis . 92. LITERATURE REVIEW. 112.1Vision for Mobile Robots. 112.1.1Map-based Navigation . 112.1.2Map-less Navigation . 122.1.3Map Building. 142.2Stereo Vision for Mobile Robots . 152.3Stereo Algorithms . 182.3.1Dense Stereo Algorithms . 182.3.2Sparse Stereo Algorithms. 212.3.3Biological Stereo Algorithms. 242.3.4Probabilistic stereo algorithms . 282.4Reconstruction from Multiple Images . 292.5Biological Navigation and Robotic Applications. 30ix

2.5.1Local Navigation . 312.5.2Way Finding. 322.5.3Cognitive Maps . 342.6Robotic Mapping. 352.6.1Taxonomy of robotic mapping. 352.6.2Problems in Robotic Mapping . 392.6.3Simultaneous Localization and Mapping. 412.7Virtual Environment Applications . 433. BIOLOGICAL STEREO VISION WITH MULTI SCALE PHASE BASEDFEATURES. 453.1Introduction and Motivation. 453.2Feature Extraction by Population Coding Method. 463.3Feature Extraction Using Steerable Filters. 503.4Finding Corresponding Pairs Using Multi-scale Phase. 563.5Finding Disparity and Depth . 573.6Complexity of the Algorithm . 693.7Probabilistic Model of the Disparity Algorithm . 713.7.1Probability Density Estimation of Phase Differences by von Mises Model713.7.2Probability of Being a Pair . 793.7.3Validation of the Probabilistic Model . 813.8Summary and Conclusion . 854. APPLICATION OF OUR ACTIVE STEREO VISION ALGORITHM ON AVIRTUAL ROBOT FOR COGNITIVE MAP FORMATION AND OBJECTRECOGNITION IN A VIRTUAL ENVIRONMENT. 884.1Introduction . 884.2Design and Implementation Details of the Simulation Software. 914.3Camera Controller . 964.4Active Vision and Cognitive Map Construction . 974.5Object Recognition. 102x

4.6Results and Conclusion . 1105. CONCLUSION . 116REFERENCES. 120APPENDICESA. DEPTH PERCEPTION IN HUMAN VISUAL SYSTEM . 128A.1Pictorial Depth Cues. 128A.1.1Occlusion (Interposition). 128A.1.2Linear Perspective . 129A.1.3Relative familiar size. 131A.1.4Focus, Depth of Field and Accommodation . 132A.2Binocular disparity (Stereopsis) . 134B. GEOMETRY FILE FORMAT. 136C. A SAMPLE CAMERACONTROLLER PLUGIN . 146D. CAMERACONTROLLER.H. 146E. C SCRIPT FOR RGB TO HSV CONVERSION. 148xi

LIST OF FIGURESFIGURE1. Modules of the whole system. 52. Flow chart of the stereo vision system. . 103. Different models for disparity encoding cells: a. Position shift model, b. Phase shiftmodel, c. Hybrid model. (Picture is taken from [19]) . 264. a,b. Stereo image pair (Blocks 1) taken from CMU database, c,d. Corresponding featurepoints with orientation given in color scale. In this case population coding method hasbeen used. 505. The template filter used in analysis. a. The real part which is the 4th derivative ofGaussian, b. The Imaginary part which is a steerable approximation to the Hilberttransform of the real part. 526. Feature points obtained by steerable filtering for the Blocks 1 stereo image pair given inFigure 4. The orientation is given in color scale. 537. Feature points extracted using filters at different number of scales. a. Single scale wherewidth of the filter is 6 pixels, b. Single scale where width of the filter is 18 pixels, c.Three scales where filter widths are 6, 12 and 18 pixels, d. Five scales where filterwidths are 6,10,14,18,22 pixels. 558. Stereo camera projection system. 589. Disparity estimated for the Blocks 1 image pair. . 5910. Fine tuning. Feature points are numbered starting from the left-most one through theright-most one and given in the x-axis of the plot. y-axis shows the disparity whererough disparity is given with (*) and fine tuned disparity given with line. . 6011. a. Left image, b. Right image, c. Disparity. Stereo images (Blocks 2) are taken fromCMU stereo database. . 62xii

12. a Left image, b. Right image, c. Disparity. Stereo images (Venus stereo pair) are takenfrom Middlebury Stereo webpage. 6313. a. Left image, b. Right image, c. Disparity. Stereo images (Sawtooth stereo pair) aretaken from Middlebury Stereo webpage. . 6414. a. Left image, b. Right image, c. Disparity. Stereo images (Tsukuba stereo pair) aretaken from Middlebury Stereo webpage. . 6515. Fine tuning. Feature points are numbered starting from the left-most one through theright-most one and given in the x-axis of the plot. y-axis shows the disparity whererough disparity is given with (*) and fine tuned disparity given with line. . 6716. Disparity calculated for Sawtooth stereo pair using: a. Single scale where filter width issix pixels, b. Three scales where filter widths are 6, 12 and 18 pixels. . 6817. Mixture of von Mises models for the correct pair phase differences. a, c, e. Componentsof the mixture model for scale 1, 2 and 3 respectively, b, d, f. Mixture model andhistogram of phase differences for scale 1, 2 and 3 respectively. 7818. Results for Venus stereo pair. a. Disparity found by the method in Section 3.5, b.Disparity found by the probabilistic model described in Section 3.6. . 8019. Results for Block stereo pair. a. Disparity found by the method in Section 3.5, b.Disparity found by the probabilistic model described in Section 3.6. . 8220. Results for Sawtooth stereo pair. a. Disparity found by the method in Section 3.5, b.Disparity found by the probabilistic model described in Section 3.6. . 8321. Results for Tsukuba stereo pair. a. Disparity found by the method in Section 3.5, b.Disparity found by the probabilistic model described in Section 3.6. . 8422. A screen shot from the virtual environment. A farm cottage and different type of treesare seen. In the upper left and right panes left and right eye views are shownrespectively. In the lower left and right panes top and front views are shownrespectively. Below of the bottom panes, there are tab based dialog boxes. . 9223. Interface for displaying feature point information. Locations of feature points for thecottage are shown on the left camera view as black dots and disparities found areshown on the right camera view in colors. 9324. Top-view with camera and target controls . 9625. States of camera controller DLL. . 10026. Flowchart of the map construction system. 101xiii

27. 3D cognitive map. x- and z- axis are the width and depth of the environmentrespectively and y-axis is for height above the ground. Only the grids for which thebelief of being occupied are high are shown here. Others are given zero values i.e. notoccupied. . 10228. a. Location and color information stored in each grid for an apple tree in the cognitivemap, b. Cognitive map grids for the top of the apple tree, c. Cognitive map grids forthe body of the apple tree. 10429. a. Location and color information stored in each grid for a pine tree in the cognitivemap, b. Cognitive map grids for the top of the pine tree, c. Cognitive map grids for thebody of the pine tree. 10530. Labeled occupancy. 10931. a. Virtual environment with three different object very far away from each other, b.Computed 3D map, c. Occupancy (top view of 3D map) with labeled objects. . 11032. a. Virtual environment with three objects of the same type, b. Computed 3D map, c.Occupancy (top view of 3D map) with labeled objects. . 11233. An illustration showing the effect of occlusion. 12934. An illustration showing linear perspective. 13035. Relative size as a cue. Which one of the balloons looks closer?. 13136. Accommodation. Blurred image. . 13237. Accommodation. Focused image. . 13