Journal of Physical Agents

3D object detection with deep learning

Félix Escalona, Angel Rodriguez, Francisco Gomez-Donoso, Jesus Martinez-Gomez, Miguel Cazorla



Finding an appropriate environment representation is a crucial problem in robotics. 3D data has been recently used thanks to the advent of low cost RGB-D cameras. We propose a new way to represent a 3D map based on the information provided by an expert. Namely, the expert is the output of a Convolutional Neural Network trained with deep learning techniques. Relying on such information, we propose the generation of 3D maps using individual semantic labels, which are associated with environment objects or semantic labels. So, for each label we are provided with a partial 3D map whose data belong to the 3D perceptions, namely point clouds, which have an associated probability above a given threshold. The final map is obtained my registering and merging all these partial maps. The use of semantic labels provide us a with way to build the map while recognizing objects.


Semantic mapping; 3D point cloud; Deep learning


Y. Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1–127, 2009.

P.J. Besl and N.D. McKay. A method for registration of 3-d shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(2):239–256, 1992.

P. Bhattacharya and M.L. Gavrilova. Roadmap-based path planning - using the voronoi diagram for a clearance-based shortest path. IEEE Robot. Automat. Mag., 15(2):58–66, 2008.

L. Bo, X. Ren, and D. Fox. Unsupervised feature learning for rgb-dbased object recognition. In Experimental Robotics, pages 387–402. Springer, 2013.

O. Booij, B. Terwijn, Z. Zivkovic, and B. Kröse. Navigation using an appearance based topological map. In International Conference on Robotics and Automation, pages 3927–3932. IEEE, 2007.

G. Carneiro, J. Nascimento, and A.P. Bradley. Unregistered multiview mammogram analysis with pre-trained deep learning models. In Medical Image Computing and Computer-Assisted Intervention, pages 652–660. Springer, 2015.

M. Cazorla, P. Gil, S. Puente, J. L. Muñoz, and D. Pastor. An improvement of a slam rgb-d method with movement prediction derived from a study of visual features. Advanced robotics, 28(18):1231–1242, 2014.

A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic mapping of indoor scenes from rgb-d images. In International Conference on Robotics and Automation, pages 2631–2638. IEEE, 2014.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678, New York, NY, USA, 2014. ACM.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

M. Labbé and F. Michaud. Online global loop closure detection for large-scale multi-session graph-based slam. In International Conference on Intelligent Robots and Systems, pages 2661–2666. IEEE, 2014.

H. Lee, R. Grosse, R. Ranganath, and A.Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 609–616. ACM, 2009.

J. Martínez-Gómez, V. Morell, M. Cazorla, and I. García-Varea. Semantic localization in the PCL library. Robotics and Autonomous Systems, 75, Part B:641 – 648, 2016.

V.N. Murthy, S. Maji, and R. Manmatha. Automatic image annotation using deep learning representations. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 603–606. ACM, 2015.

N. Neverova, C. Wolf, G.W. Taylor, and F. Nebout. Multi-scale deep learning for gesture detection and localization. In Computer Vision-ECCV 2014 Workshops, pages 474–490. Springer, 2014.

A. Pronobis, O. Martinez Mozos, B. Caputo, and P. Jensfelt. Multimodal semantic place classification. The International Journal of Robotics Research, 2009.

J.C. Rangel, M. Cazorla, I. García-Varea, J. Martínez-Gómez, É. Fromont, and M. Sebban. Scene classification based on semantic labeling. Advanced Robotics, pages 1–12, 2016.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, An. Karpathy, A. Khosla, M. Bernstein, A.C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.8

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012, pages 746–760. Springer, 2012.

J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 1470–1477. IEEE, 2003.

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. 2015.

S. Thrun et al. Robotic mapping: A survey. Exploring artificial intelligence in the new millennium, pages 1–35, 2002.

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, pages 487–495, 2014.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.