Leap

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Leap represents an entirely new way to interact with your computers. It’s more accurate than a mouse, as reliable as a keyboard and more sensitive than a touchscreen.  For the first time, you can control a computer in three dimensions with your natural hand and finger movements.

This isn’t a game system that roughly maps your hand movements.  The Leap technology is 200 times more accurate than anything else on the market — at any price point. Just about the size of a flash drive, the Leap can distinguish your individual fingers and track your movements down to a 1/100th of a millimeter.

This is like day one of the mouse.  Except, no one needs an instruction manual for their hands

https://live.leapmotion.com/about/

[via]

You can edit this ad by going editing the index.php file or opening /images/exampleAd.gif

Why your vision lab needs a reading group

This post by Tomasz Malisiewicz has been reprinted from tombone's blog.

I have a certain attitude when it comes to computer vision research -- don't do it in isolation. Reading vision papers on your own is not enough.  Learning how your peers analyze computer vision ideas will only strengthen your own understanding of the field and help you become a more critical thinker.  And that is why at places like CMU and MIT we have computer vision reading groups.  The computer vision reading group at CMU (also known as MISC-read to the CMU vision hackers) has a long tradition, and Martial Hebert has made sure it is a strong part of the CMU vision culture.  Others ex-CMU hackers such as Sanjiv Kumar have continued the vision reading group tradition onto places such as Google Research in NY (correct me if this is no longer the case).  I have continued the reading group tradition to MIT (where I'm currently a postdoc) because I was surprised there wasn't one already!  In reality, we spend so much time talking about papers in an informal setting, that I felt it was a shame to not do so in a more organized fashion.
My personal philosophy is that as a vision researcher, the way towards the goal of creating novel long-lasting ideas is learning how others think about the field.  There's a lot of value in being able to analyze, criticize, and re-synthesize other researchers' ideas.  Believe me when I say that a lot of new vision papers come out of top tier vision conferences every year.  You should be reading them!  But not just reading, also criticizing them among your peers.  Because once you learn to criticize others' ideas, you will become better at promulgating your own.  Do not equate criticism with nasty words for the sake of being nasty -- good criticism stems from a keen understanding of what must be done in science to convince a broad audience of your ideas.

In case you want to start your own computer vision research group, I've collected some tips, tricks, and advice:

1. You don't need faculty.  If you can't find a season vision veteran to help you organize the event, do not worry.  You just need 3+ people interested in vision and the motivation to maintain weekly meetings.  Who cares if you don't understand every detail of every paper!  Nobody besides the authors will ever understand every detail.

2. Be fearless.  Ask dumb questions.  Alyosha Efros taught me that if you're reading a paper or listening to a presentation, if you don't understand something then there's a good chance you're not the only one in the audience with the same questions.  Sometimes younger PhD students are afraid of "asking a dumb question" in front of audience.  But if you love knowledge, then it is your duty to ask.  Silence will not get you far.  Be bold, be curious, and grow wise.  

3. Choose your own papers to present.  Do not present papers that others want you to present -- that is better left for a seminar course led by a faculty member.  In a reading group it is very important that you care about the problems you will be discussing with your peers.  If you keep up with this trend then when it comes to "paper writing time" you should be up to date on many relevant papers in your field and you will know about your other lab mates' research interests.

4. It is better to show a paper PDF up on a projector than cancel a meeting.  Even if everybody is busy, and the presenter didn't have time to create slides, it is important to keep the momentum going.

5. After a major conference, have all of the people who attended the conference present their "top K paper."  The week after CVPR it will be valuable to have such a massive vision brain dump onto your peers because it is unlikely that everybody got to attend. 

6. Book a room every week and try to have the meeting at the same time and place.  Have either the presenter or the reading group organizer send out an announcement with the paper they will be presenting ahead of time.  At MIT we share a google doc with the information about interesting papers and the upcoming presenter usually chooses the paper one week in advance so that the following week's presenter doesn't choose the same paper.  If somebody already presents your paper, don't do it a second time!  Choose another paper.  cvpapers.com is a great resource to find upcoming papers.

At CMU, there is a long rotating schedule which includes every vision student and faculty member.  Once it is your time to present, you can only get off the hook if you swap your slot with somebody else.  Being on a schedule months in advance means you'll have lots of time to prepare your slides.  At MIT, we are currently following the object recognition / scene understanding / object detection theme where we (Prof. Torralba, his students, his postdocs, his visiting students, etc) choose a paper highly relevant to our interests.  By keeping such a focus, we can really jump into the relevant details without having to explain fundamental concepts such as SVMs, features, etc.  However, at CMU the reading group is much broader because on the queue are students/profs interested in all aspects of vision and related fields such as graphics, illumination, geometry, learning, etc.






[New Paper] Dynamic two-stage image retrieval from large multimedia databases

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Avi Arampatzis | Konstantinos Zagoris | Savvas A. Chatzichristofis

Information Processing & Management

Content-based image retrieval (CBIR) with global features is notoriously noisy, especially for image queries with low percentages of relevant images in a collection. Moreover, CBIR typically ranks the whole collection, which is inefficient for large databases. We experiment with a method for image retrieval from multimedia databases, which improves both the effectiveness and efficiency of traditional CBIR by exploring secondary media. We perform retrieval in a two-stage fashion: first rank by a secondary medium, and then perform CBIR only on the top-K items. Thus, effectiveness is improved by performing CBIR on a ‘better’ subset. Using a relatively ‘cheap’ first stage, efficiency is also improved via the fewer CBIR operations performed.

Full-size image

Our main novelty is that K is dynamic, i.e. estimated per query to optimize a predefined effectiveness measure. We show that our dynamic two-stage method can be significantly more effective and robust than similar setups with static thresholds previously proposed. In additional experiments using local feature derivatives in the visual stage instead of global, such as the emerging visual codebook approach, we find that two-stage does not work very well. We attribute the weaker performance of the visual codebook to the enhanced visual diversity produced by the textual stage which diminishes codebook’s advantage over global features. Furthermore, we compare dynamic two-stage retrieval to traditional score-based fusion of results retrieved visually and textually. We find that fusion is also significantly more effective than single-medium baselines. Although, there is no clear winner between two-stage and fusion, the methods exhibit different robustness features; nevertheless, two-stage retrieval provides efficiency benefits over fusion.

http://www.sciencedirect.com/science/article/pii/S0306457312000489

Swarmanoid

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Swarmanoid, The Movie receives the AAAI-2011 Best Video Award at the San Francisco annual event!

Reach and grasp by people with tetraplegia using a neurally controlled robotic arm

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Nature 485, 372–375 (17 May 2012)

 

Paralysis following spinal cord injury, brainstem stroke, amyotrophic lateral sclerosis and other disorders can disconnect the brain from the body, eliminating the ability to perform volitional movements. A neural interface system could restore mobility and independence for people with paralysis by translating neuronal activity directly into control signals for assistive devices. We have previously shown that people with long-standing tetraplegia can use a neural interface system to move and click a computer cursor and to control physical devices6, 7, 8. Able-bodied monkeys have used a neural interface system to control a robotic arm9, but it is unknown whether people with profound upper extremity paralysis or limb loss could use cortical neuronal ensemble signals to direct useful arm actions. Here we demonstrate the ability of two people with long-standing tetraplegia to use neural interface system-based control of a robotic arm to perform three-dimensional reach and grasp movements. Participants controlled the arm and hand over a broad space without explicit training, using signals decoded from a small, local population of motor cortex (MI) neurons recorded from a 96-channel microelectrode array. One of the study participants, implanted with the sensor 5 years earlier, also used a robotic arm to drink coffee from a bottle. Although robotic reach and grasp actions were not as fast or accurate as those of an able-bodied person, our results demonstrate the feasibility for people with tetraplegia, years after injury to the central nervous system, to recreate useful multidimensional control of complex devices directly from a small sample of neural signals.

http://www.nature.com/nature/journal/v485/n7398/full/nature11076.html

Shuffling label colors

This post by Steve Eddins has been reprinted from Steve on Image Processing.

I've written often here about various computational and visualization techniques involving labeling connected components in binary images. Sometimes I use the function label2rgb to convert a label matrix into a color image with a different color assigned to each label.

Here's an example.

bw = imread('http://blogs.mathworks.com/images/steve/2012/rice-bw.png');
imshow(bw)

Now compute the connected components and the corresponding label matrix.

cc = bwconncomp(bw);
L = labelmatrix(cc);

In the label matrix, each foreground object in the original binary image is assigned a unique positive integer. Here, for instance, is how to display the tenth object.

imshow(L == 10)

Use the function label2rgb to "colorize" the label matrix.

rgb = label2rgb(L);
imshow(rgb)

That's a nice effect, but because of the way the colors are assigned, object near each other tend to have very similar colors. It might be better sometimes to assign the colors differently. That's what the 'shuffle' argument to label2rgb is for.

Here's the full syntax including 'shuffle':

rgb = label2rgb(L,map,zerocolor,'shuffle');

zerocolor is a three-element vector specifying what color is used for the background pixels.

Let's try it with the jet colormap and a light gray background color.

rgb = label2rgb(L,'jet',[.7 .7 .7],'shuffle');
imshow(rgb)

Now it's easier to see where two objects might be actually touching and so receive the same label. Let's zoom in closer to see:

xlim([128 185]);
ylim([5 62]);

I hope you find this useful!


Get the MATLAB code

Published with MATLAB® 7.14

Document Recognition and Retrieval XX (2013)

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Document Recognition and Retrieval XX (2013), http://www.cs.rit.edu/~drr2013

San Francisco, Feb. 5-7, 2013

Paper Submission Deadline: July 23, 2012 (11:59 PST)

Document Recognition and Retrieval (DRR)is one of the leading international conferences devoted to current research in document analysis, recognition and retrieval. The 20th Document Recognition and Retrieval Conference is being held as part of SPIE Electronic Imaging, from Feb. 5-7, 2013 in San Francisco, California, USA.

One keynote speaker has been confirmed, Ray Smith of Google Research.

Ray will be presenting on the development of the widely used open source Tesseract OCR engine, relating this to changes in document recognition systems since the first DRR was held in 1994.

The Conference Chairs and Program Committee invite all researchers working on document recognition and retrieval to submit original research papers. Papers are presented in oral and poster sessions at the conference, along with invited talks by leading researchers. Accepted papers will be published by the SPIE in the conference proceedings. At the conference a Best Student Paper Award will be presented.

Papers are solicited in, but not limited to, the areas below.

Document Recognition

  • Text recognition:machine-printed, handwritten documents; paper, tablet, camera, and video sources
  • Writer/style identification, verification, and adaptation
  • Graphics recognition:vectorization (e.g. for line-art, maps and technical drawings), signature, logo and graphical symbol recognition, figure, chart and graph recognition, and diagrammatic notations (e.g. music, mathematical notation)
  • Document layout analysis and understanding:document and page region segmentation, form and table recognition, and document understanding through combined modalities (e.g. speech and images)
  • Evaluation:performance metrics, and document degradation models
  • Additional topics:document image filtering, enhancement and compression, document clustering and classification, machine learning (e.g. integration and optimization of recognition modules), historical and degraded document images (e.g. fax), multilingual document recognition, and web page analysis (including wikis and blogs)

Document Retrieval

  • Indexing and Summarization:text documents (messages, blogs, etc.), imaged documents, entity tagging from OCR output, and text categorization
  • Query Languages and Modalities:Content-Based Image Retrieval (CBIR) for documents, keyword spotting, non-textual query-by-example (e.g. tables, figures, math), querying by document geometry and/or logical structure, approximate string matching algorithms for OCR output, retrieval of noisy text documents (messages, blogs, etc.), cross and multi-lingual retrieval
  • Evaluation:relevance and performance metrics, evaluation protocols, and benchmarking
  • Additional topics:relevance feedback, impact of recognition accuracy on retrieval performance, and digital libraries including systems engineering and quality assurance

Important Dates

  • 23 July, 2012: Paper submission deadline
  • Late August, 2012: Author notifications
  • 26 November, 2012: Final paper submission deadline
  • 5-7 February, 2013: Conference dates

*Paper Submission

All paper submissions should be between 8-12 pages in length, using the SPIE LaTeX template (available from conference web pages). For accepted papers, final submissions will also be 8-12 pages in the same format. Papers should clearly identify the problem addressed in the paper, identify the original contribution(s) of the paper, relate the paper to previous work, and provide experimental and/or theoretical evaluation as appropriate. Submissions should be uploaded through the conference web site (http://www.cs.rit.edu/~drr2013/submission.html).

IEEE International Symposium on Multimedia 2012 (ISM2012)

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Irvine, CA, USA, December 10-12, 2012

http://ism.eecs.uci.edu/ISM2012/

The IEEE International Symposium on Multimedia (ISM2012) is an international forum for researchers to exchange information regarding advances in the state-of-the-art and practice of multimedia computing, as well as to identify emerging research topics and define the future of multimedia computing. The technical program of ISM2012 will consist of invited talks, paper presentations, demonstrations and panel discussions.

Please refer to the conference website for further information:

http://ism.eecs.uci.edu/

IMPORTANT DATES

  • Jun 8th, 2012: Panel Proposal Submission
  • Jul 8th, 2012: Regular & Short Paper Submission
  • Jul 8th, 2012: Industry Paper Submission
  • Jul 22nd, 2012: Demo Proposal Submission
  • Jul 22nd, 2012: PhD Workshop Paper Submission
  • Aug 24th, 2012: Panel Notification
  • Aug 24th, 2012: Paper and Demo Notification

SUBMISSIONS

Authors are invited to submit Regular Papers (8-page technical paper), Short Papers (4-page technical paper), Demonstration Papers and Posters (2 page technical paper), PhD Workshop Papers (2 pages), and Workshop Proposals as well as Industry Track Papers (8-page technical paper) which will be included in the proceedings. A main goal of this program is to present research work that exposes the academic and research communities to challenges and issues important for the industry. More information is available on the ISM2012 web page. The Conference Proceedings will be published by IEEE Computer Society Press. Distinguished quality papers presented at the conference will be selected for publication in internationally renowned journals, among them the IEEE Transactions on Multimedia.

AREAS OF INTEREST INCLUDE (but are not limited to):

*Multimedia Systems and Architectures

Architecture and applications, GPU-based architectures and systems, mobile multimedia systems and services, pervasive and interactive multimedia systems including mobile systems, pervasive gaming, and digital TV, multimedia/HD display systems, multimedia in the Cloud, software development using multimedia techniques.

*Multimedia Interfaces

Multimedia information visualization, interactive systems, multimodal interaction, including human factors, multimodal user interfaces: design, engineering, modality-abstractions, etc., multimedia tools for authoring, analyzing, editing, browsing, and navigation, novel interfaces for multimedia etc.

*Multimedia Coding, Processing, and Quality Measurement

Audio, video, image processing, and coding, coding standards, audio, video, and image compression algorithms and performance, scalable coding, multiview coding, 3D/multi-view synthesis, rendering, animation coding, noise removal techniques from multimedia, panorama, multi-resolution or superresolution algorithms, etc.

*Multimedia Content Understanding, Modeling, Management, and Retrieval

Multimedia meta-modeling techniques and operating systems, computational intelligence, vision, storage/archive systems, databases, and retrieval, multimedia/video/audio segmentation, etc.

*Multimedia Communications and Streaming

Multimedia networking and QoS, synchronization, HD video streaming, mobile audio/video streaming, wireless, scalable streaming, P2P multimedia streaming, multimedia sensor networks, internet telephony, hypermedia systems, etc.

*Multimedia Security

Multimedia security including digital watermark and encryption, copyright issues, surveillance and monitoring, face detection & recognition algorithms, human behavior analysis, multimedia forensics, etc.

*Multimedia Applications

3D multimedia: graphics, displays, sound, broadcasting, interfaces, multimedia composition and production, gaming, virtual and augmented reality, applications for mobile systems, multimedia in social network analysis:

YouTube, Flickr, Twitter, Facebook, Google+, etc., elearning, etc.

How good is Google Drive’s image recognition engine?

This post by Ludwig Schmidt-Hackenberg has been reprinted from Helping The Blind.

As announced via twitter I took the time to test Google Drive’s image recognition feature. Google Drive was announced two weeks ago with a blog post, which contained the bold claim:

Search everything. Search by keyword and filter by file type, owner and more. … We also use image recognition so that if you drag and drop photos from your Grand Canyon trip into Drive, you can later search for [grand canyon] and photos of its gorges should pop up. This technology is still in its early stages, and we expect it to get better over time.

This sparked my curiosity, so I evaluated Google Drive’s performance like I would with the image recognition frameworks I do my research on. First I uploaded an image dataset and with images containing known objects and then counted how many of the pictures Google Drive’s search would find, if I search for these objects.

As dataset I used the popular  Caltech 101 dataset containing pictures of objects belonging to 101 different categories. There are about 40 to 800 images per category and roughly 4500 images in total. While being far from perfect, it is a well-known contender.

These are my first finding:

  • Google Drive only finds a fraction of the images, but the images it finds it categorizes correctly.

  • In numbers: Precision is 83% (std=36%) and the recall is 8% (std=11%) (averaged over all categories)
  • The best results it achieves for the two ‘comic’ categories ‘Snoopy’ and ‘Garfield’ and for iconic symbols like the dollar bill and the stop sign.
  • As the The Caltech 101 dataset was created using Google’s image search the high precision is at least partly a result of a ‘simple’ duplicate detection with the Google index and not of a successful similarity search.

Verdict:

As all vision systems working in such an unconstrained environment they are far from being actually usable. One cannot rely on them, but once or twice they will surprise you by adding an image to the result list, that one hasn’t thought of.

Further resources:


Next Instagram

This post by Ramesh has been reprinted from Ramesh Jain's Blog.

Since the acquisition of Instagram earlier, the guessing game of who is next ‘Instagram’ to be acquired has become popular. All kinds of guesses are around. You may also have your favorite company or area that you think is the next Instagram.

I find this guessing game interesting.

Instagram was unique and so will be the next company to be very successful – the so-called Next Instagram. The idea that one can simply find a ‘similar’ company in slightly different space seems naive. Instagram was unique in its approach to use filters in such a simple, expressive, spontaneous way that people just could not resist the urge to share their photos. This was photo sharing app, like many others, but was very easy to use, it allowed people to express themselves (note it did not express — it just gave very easy ways to express just a few things), and promoted spontaneity. You don’t have to spend lot of time trying to be ‘expressive’ in your communication. The most powerful communication is usually the one that does not come in the way. If the tools for communication take more effort than the satisfaction given by the need of modern generation for instantaneous communication, then it is likely to be rejected. Most other sharing applications even after Instagram’s success do not have courage to accept the elegance offered by the courage to reject featuritis.

The space of communication and experience sharing is huge. Much of human activities are based on this communication. Interestingly, even in 2012, most of the experience sharing mechanisms have not used even a fraction of experience capture and sharing ability offered by mobile phones. Most applications do utilize photos/videos and location. And yes visual information is dominant mode in experience sharing, but even visual information becomes order of magnitudes more experiential when enhanced using the context and other experiential data. By effectively using multiple modes and sources of experiential data, it is possible to communicate holistic experience that may even surpass one day the experience by being there. Many of us may find it difficult to believe, but that is already happening in some areas and will happen in many more areas and will become available commonly on regular devices.

Most of the current successful companies on mobile phones have used the experiential and contextual power of these devices only partially. Many interesting apps are emerging in this space and most address one component of the experience. What we require is breaking the silos and creating more realistic experiences. And that is now possible using emerging devices.

Frontiers Of Interaction – June 7-8, 2012 – Rome #foi12

This post by Alessandro Ferrari has been reprinted from Posts of Blog & News.

Frontiers of Interaction was founded in 2005 to explore topics and ideas in the field of Interaction Design. In a very short time, it has become known as the leading Innovation conference in Italy.

Concept

Frontiers of Interaction is a hybrid show that attracts inspiring international speakers and Italian talents, creating a bridge between Europe and Silicon Valley (digital cultural “hot spots” around the world) .

  • Local and international speakers.
  • Multidisciplinary audience of managers, researchers and media professionals.
  • A passionate cream of the crop team of organizers.

The unusual format creates an immersive experience featuring music, interactive and artistic installations, demo sites and keynotes, and makes it an ideal venue for thinkers and doers, innovators and academics, early adopters and long-term geeks.

Numbers

  • Frontiers of Interaction 2010

    . 500+ attendees
    . 15% of the attendees coming from outside Italy
    . peaks of 3200 simultaneous users watching our free live vide
    . 16 speakers (including 2 keynotes)
    . 4 workshops

  • Frontiers of Interaction 2009

    . 300+ attendees: CEO and decision makers (35%), researchers (20%) and students (15%)
    . 25 speakers coming from 6 Countries
    . 50 passionate Italian entrepreneurs, geeks and experts gave time and energy to help organizing the event

Who’s Frontiers for?

Frontiers is an event born in Academia; and since its birth it always benefited of the presence of professors, top students and researchers that brought the conference forward-thinking ideas and fresh talent that now represent nearly the 20% of our audience.

Given our proved ability to anticipate technology waves (we’ve done so speaking about web 2.0 in 2004, about Second Life in 2006 and about Internet of Things in 2008) We are a conference for think leaders coming from all around Europe and abroad.

Frontiers mission is to create an experience for the audience: involving everybody in a continuous flow of energy, knowledge and tech visions:

  • Italian start-up entrepreneurs join us because they know Frontiers is the best place to promote their product, to find partnerships, to talk to investors and angels.

  • Investors on their side join us to meet the start-ups that are going to hit the market.

  • Bloggers and tech journalists love Frontiers because they know how the high quality of the speakers and the energy of the conference will provide them with new cutting-edge content.

  • Food loving technologists know how well our buffet lunch is going to treat them with top notch Italian food and wines.

  • Frontiers is one of the best places in Europe where geeks and makers gathers to network and try the latest technologies.

  • We also like to meet conference organizers that are at Frontiers to experiment how the audience could be better entertained and involved into an experience, more than into a “simple” conference.

CEO and managers from tech companies are in the audience to learn from our speakers, to have a better understanding of the forthcoming technologies, and to recruit too!

Program:

5 workshops! 4 conference tracks! More than 25 speakers and guests!  But this is just the beginning. Check out the detailed schedule of the event for Day One and Day Two

June, 7th – WORKSHOPS

Day #1

“The art and the making of presenting ideas” with Garr Reynolds
“Building together an Augmented Reality app” with Mauro Rubin (Ceo, Joinpad)
“Lessons from space for your business” with Simonetta di Pippo, European Space Agency

June, 8th – CONFERENCE

Day #2

David Crane, game designer & co-founder Activision
John Rogers, Founder and CEO, Local Motors
Space X, Space expeditions
Kevin Kelly, author and co-founder, Wired (video)
Tanya Vlach, Cyborg Artist looking for projects to implant an eye-camera.
Ninja Blocks, build your web of things without coding
Romotive, a robot that uses your smartphone as its brain
Johanna Kollmann, UX pro and Product Manager, Sidekick Studios
Yael Elish, VP Product and Marketing, Waze
Christophe Duteil, CEO, ePawn


Contatore sito

sFly Quadrotors Navigate Outdoors All By Themselves

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

Article from IEEE Spectrum

Quadrotors are famous for being able to pull all sorts of crazy stunts, but inevitably, somewhere in the background of the amazing video footage of said crazy stunts you'll notice the baleful red glow of a Vicon motion tracking system. Now, we don't want to call this cheating or anything, but we're certainly looking forward to the day when quadrotors can do this outside of a lab, and the sFly project is helping to make this happen.

What makes the sFly project, led by ETH Zurich's Autonomous Systems Lab, different is that the sFly quadrotors don't rely on motion capture systems. They also don't rely on GPS, remote control, radio beacons, laser rangefinders, frantically waving undergrads, or anything else. The only thing that sFly has to go on is an IMU and an onboard camera (and an integrated computer), but using just those systems (and a "very efficient onboard inertial-aided visual simultaneous localization and mapping algorithm"), sFly is capable of navigating all by itself. And if you have a fleet of sFly quadrotors, you can use them to make cooperative 3D maps of the environment:

Each quadrotor is completely autonomous, but they're also equipped with two extra cameras that stream stereo imagery back to a central computer over GSM or Wi-Fi that takes the data from several quadrotors and combines it into an overall 3D model of the environment as a whole. Then, the computer can guide each robot to an optimal surveillance site. The idea here is that you'd be able to rapidly deploy an sFly system with a swarm autonomous quadrotors in a disaster area or somewhere else without any infrastructure (or even a GPS signal) and still be able to take advantage of some clever autonomous aerial mapping.

(please note: video is also available on 3D)

Article from IEEE Spectrum

ICPR12 contest on Kinect-based gesture recognition

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

The ICPR kinect-based gesture recognition challenge opens on May 7 (cash prizes & more). See URL and below:

http://gesture.chalearn.org/dissemination/icpr2012

ChaLearn takes gesture recognition to the crowd with Microsoft Kinect(TM)

A competition to help improve the accuracy of gesture recognition using Microsoft Kinect(TM) motion sensor technology promises to take man-machine interfaces to a whole new level. From controlling the lights or thermostat in your home to flicking channels on the TV, all it will take is a simple wave of the hand. And the same technology may even make it possible to automatically detect more complex human behaviors, to allow surveillance systems to sound an alarm when someone is acting suspiciously, for example, or to send help whenever a bedridden patient shows signs of distress.

Through its low cost 3D depth-sensing cameras, Microsoft Kinect(TM) has already kick-started this revolution by bringing gesture recognition into the home. Humans can recognize new gestures after seeing just one example (one-shot-learning). With computers though, recognizing even well-defined gestures, such as sign language, is much more challenging and has traditionally required thousands of training examples to teach the software.

To see what the machines are capable of, ChaLearn launched a competition hosted by Kaggle with prizes donated by Microsoft, in the hope they can give the state of the art a rapid boost. The ChaLearn team has been organizing competitions since 2003, featuring hard problems such as discovering cause-effect relationships in data. It has selected the young and dynamic startup Kaggle to host the gesture challenge because Kaggle has very rapidly established a track record for using crowdsourcing to find solutions that outperform state-of-the- art algorithms and predictive models in a wide variety of domains (from helping NASA build algorithms to map dark matter to helping insurance companies improves claims prediction). And now the first round of the gesture challenge helped narrow down the gap between machine and human performance. Over a period of four months starting in December 2011, 153 contestants making 573 entries have built software systems that are capable of learning from a single training example of a hand gesture (so-called one-shot-learning). They lowered the error rate, starting from a baseline method making more than 50% error to less than 10% error.

The winner of the challenge, Alfonso Nieto Castanon, used a method he invented, which is inspired by the human vision system. He and the second and third place winners will be awarded $5000, $3000 and $2000 respectively and get an opportunity to present their results in front of an audience of experts at the CVPR 2012 conference in Rhode Island, USA, in June. A demonstration competition of gesture recognition systems using Kinect(TM) will also be held in conjunction with this event, with similar prizes donated by Microsoft.

Now, from May 7 and until September 10, new competitors can enter round 2 of the challenge and get a chance to close the gap with human performance, which is under 2% error! The entrants are given a set of examples with which to apply and test their algorithms, so that they may improve them. Compared to round 1, they will benefit from a wealth of resources including the fact sheets and published papers of the participants of round 1, data annotations, and data transformations having had success in round 1. During a four month period they will be able to compare their system with those of other contestants, by using it to predict gestures from a feedback sample. Throughout the competition the evaluations of these are posted on a live leaderboard, so participants can monitor their performance in real time. The contestants will then have the opportunity to put their best algorithms to the final test in an evaluation phase. Here they will be given a few days to train their system on an entirely new set of gestures, after which the one with the best recognition score will be rewarded with $5000. Those coming second and third place will receive

$3000 and $2000 respectively. Similarly as in round 1, the results will be discussed at a scientific conference (ICPR 2012, Tsukuba, Japan, November 2012) where a demonstration competition will be held also crowned with prizes in the same amount. Microsoft will be evaluating successful participants in all challenge rounds for two potential IP agreements of $100,000 each. See official challenge rules for more details at http://gesture.chalearn.org.

The winner of the first round believes that it is possible to reach and even beat human performance. Others will also join in the race.

According to Kaggle, that is the power of the crowd: bringing together expert talent, sometimes from previously untapped quarters. And with Microsoft interested in buying the intellectual property, the hope is that the new algorithms that emerge from the contest will not only boost accuracy but also open the doors to a whole new range of applications. From using communicating with Kinect(TM) through sign language or even speaking, with the algorithms interpreting what you say by reading your lips to smart homes or using gestures to control surgical robots.

The challenge was initiated by the US Defense Advanced Research Projects Agency (DARPA) Deep Learning Program and is supported by the US National Science Foundation, the European Pascal2 network of excellence, Microsoft and Texas Instruments. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors and funding agencies.

The DFT matrix and computation time

This post by Steve Eddins has been reprinted from Steve on Image Processing.

On my list of potential blog topics today I saw just this cryptic item labeled dftmtx. Hmm, the MATLAB dftmtx function. But have I written about this function before? I better double-check by searching the old blog postings:

Oh, I forgot about Loren's post! She showed how the discrete Fourier transform, which MATLAB users normally compute by calling fft, can also be computed via a matrix multiply.

x = rand(1000,1);
X = fft(x);

T = dftmtx(1000);
X2 = T*x;

max(abs(X2(:) - X(:)))
ans =

   3.9790e-13

The difference is just floating-point round-off error.

My old post talked about the value of having an independent computation method available when you are testing your algorithm.

So today let's do something a little different. Let's compare the performance of computing the discrete Fourier transform using the DFT matrix versus using the fast Fourier transform.

To help with the timing, I'm going to use a function I wrote called timeit that you can download from the MATLAB Central File Exchange.

clear
n = 100:50:3000;
for k = 1:length(n)
    nk = n(k);
    x = rand(nk,1);
    T = dftmtx(nk);

    f = @() T*x;
    g = @() fft(x);

    times_f(k) = timeit(f);
    times_g(k) = timeit(g);
end

plot(n,times_f,n,times_g)
legend({'Using T*x', 'Using fft(x)'})

That's a pretty dramatic difference. The blue curve, showing the computation time using T*x, is an n^2 curve. Compared to that, the green curve, showing the computation time using fft(x), is so low that you can hardly see it.

Let's expand the y axis.

ylim([0 0.0005])

The lower green curve is an n*log(n) curve. The dramatic difference between that and the n^2 curve is why everyone got so excited when the fast Fourier transform algorithm was invented (or re-invented) a few decades ago.


Get the MATLAB code

Published with MATLAB® 7.14

OpenCV 2.4 officialy out!

This post by Alessandro Ferrari has been reprinted from Posts of Blog & News.

No meaningful changes from 2.4 beta release. The official log is:
  • OpenCV now provides pretty complete build information via (surprise) cv::getBuildInformation().
  • reading/writing video via ffmpeg finally works and it's now available on MacOSX too.
    note 1: we now demand reasonably fresh versions of ffmpeg/libav with libswscale included.
    note 2: if possible, do not read or write more than 1 video simultaneously (even within a single thread) with ffmpeg 0.7.x or earlier versions, since they seem to use some global structures that are destroyed by simultaneously executed codecs. Either build and install a newer ffmpeg (0.10.x is recommended), or serialize your video i/o, or use parallel processes instead of threads.
  • MOG2 background subtraction by Zoran Zivkovic was optimized using TBB.
  • The reference manual has been updated to match OpenCV 2.4.0 better (though, not perfectly).
  • Asus Xtion is now properly supported for HighGUI. For now, you have to manually specify this device by using VideoCapture(CV_CAP_OPENNI_ASUS) instead of VideoCapture(CV_CAP_OPENNI).


    Contatore sito


ACM International Conference on Multimedia Retrieval (ICMR) 2012

This post by Savvas Chatzichristofis has been reprinted from Image processing and Retrieval Trends.

ACM International Conference on Multimedia Retrieval (ICMR) 2012
June 5-8, 2012, Hong Kong
http://www.icmr2012.org/
Venue:
Kowloon Shangri-La Hotel &
Run Run Shaw Creative Media Centre, City University of Hong Kong
http://cmc.scm.cityu.edu.hk/en/
====================================================
Multimedia computing, indexing and retrieval continue to be one of the most exciting and fastest-growing research areas in the field of multimedia technology. ICMR is the premier conference in the area of multimedia retrieval, offering opportunities for the exchange of ideas between researchers, practitioners and potential users of multimedia retrieval systems. The conference, puts together the long-lasting experience of former ACM CIVR and ACM MIR series, was set up to illuminate the state of the art in multimedia (including image, video and audio) retrieval.
ICMR 2012 offers the following highlights:

Three keynote speeches
- Cortically-coupled computing for media retrieval, by Paul Sajda from Columbia University, USA
- Aggregating local image descriptors for large-scale image retrieval and classification, by Cordelia Schmid from INRIA LEAR, France
- The road to pervasive multimedia search and multimodal interaction, by Hsiao-Wuen Hon from Microsoft Research Asia, China.

Three tutorial sessions
- Foundations of large-scale multimedia information management & retrieval, by Edward Y. Chang from Google Research, and Chih-Jen Lin from National Taiwan University
- Music information retrieval, by Markus Schedl from Johannes Kepler University, and Masataka Goto National Institute of Advanced Industrial Science and Technology.
- 3D Video Segmentation, Recognition, and Retrieval, by B. Prabhakaran from University of Texas at Dallas

Five regular oral sessions
- Annotation and classification
- Fresh views on multimedia retrieval
- Near-duplicate and copy detection
- Machine learning and hashing for multimedia retrieval
- Best paper session

Two special sessions
- Social events in Web multimedia
- Socio-Video Semantics

Practitioner Day including a keynote, project demonstrations, industrial sessions, panel discussion

KEY INFORMATION
ICMR 2012 Website: http://www.icmr2012.org/
Technical Program: http://www.icmr2012.org/program.html
Registration: http://www.icmr2012.org/registration.html

A list of lists of PhD resources

This post by Ludwig Schmidt-Hackenberg has been reprinted from Helping The Blind.

On your way to become a PhD, you not only have to learn how to do research, you also have to learn how to communicate your ideas comprehensible in text and speech, how to build the tools you need and how to survive in the microcosmos of supervisors, colleagues and under grad students of your research lab. But you are not the first to go through all this and people have written extensive advice for every problem you might encounter.  And as they are so popular right now, I present you here my: 

List of lists of PhD resources for computer scientists.

List of lists of lists

List of lists of lists!?

The most condensed summary I have found on the website of my work group IUPR. It is a good starter and gives one an overview of all the things one has to keep in mind and pay attention to.

From the most condensed to the most comprehensive. This collection links to nearly 100 articles on Ph.D. dissertation/research, presentations, writing,  reviewing/refereeing,  being a faculty member, job hunting, learning English and more. The list is overwhelming.

Links to documents on giving talks and writing papers and proposals.

from the UCSD VLSI CAD LABORATORY

If you did not actually study computer science (like me) or your courses mainly covered logic and reducing NP-complete problems, this site can probably help you a lot. Software carpentry is about learning the skills to write reliable software and using the existing tools efficiently. The website offers tutorials on basic programming, version control, testing, using the shell, relational databases,  matrix programming, program designing, spreadsheets, data management, and software life-cycles.

So what do you think? Do you find these resources useful? Some of them are already quite old. Do you think they are obsolete? What are you tips? Which collections did I forget?


Study on Distortion of Image and Video Thumbnails

This post by Klaus Schoeffmann has been reprinted from Image processing and Retrieval Trends.

Due to the highly diverse availability of digital cameras and camcorders with different input resolutions computer systems need to manage images and videos with different aspect ratios (e.g., 4:3, 16:9, 16:10, etc.). Therefore, developers of large-scale image and video browsing and retrieval tools need to find a way of either presenting all thumbnails with their correct aspect ratio, which often conflicts with a harmonic visualization, or to crop or distort thumbnails to one specific aspect ratio. In the paper "A Visual Search User Study on the Influences of Aspect Ratio Distortion of Preview Thumbnails" (to be presented at the International Workshop on Advances in Large-Scale Multimedia Data Collection, Mining and Retrieval at ICME 2012), the authors (David Ahlström and Klaus Schoeffmann) present results from a user study on the influence of aspect ratio distortion on visual search performance. The results show that even heavily distorted thumbnails do not notably influence visual search time or error rate. A preprint of the paper is available here.

One Part Basis to Rule them All: Steerable Part Models

This post by Tomasz Malisiewicz has been reprinted from tombone's blog.

Last week, some of us vision hackers at MIT started an Object Recognition Reading Group.  The group is currently in stealth-mode, but our goal is to analyze, criticize, and re-synthesize ideas from the object detection/recognition community.  To inaugurate the group, I covered Hamed Pirsiavash's Steerable Part Models paper from the upcoming CVPR 2012 conference.  As background reading, I had to go over the mathematical basics of learning with tensors (i.e., multidimensional arrays) which were outlined in their earlier NIPS 2009 paper, Bilinear Classifiers for Visual Recognition.  After reading up on their work, I have a better grasp of what the trace operator actually does.  It is nothing more than a Hermitian inner product defined between the space of linear operators from C^N to C^M (see post here for geometric interpretations of the trace).




Hamed Pirsiavash, Deva Ramanan, "Steerable part models", CVPR 2012


"Our representation can be seen as an approach to sharing parts." 
-- H. Pirisiavash and D. Ramanan


The idea behind this paper is relatively simple -- instead of learning category-specific part-models, learn a part-basis from which all category-specific part models come from.  Consider the different parts learned from a deformable part model (see Felzenszwalb's DPM page for more info about DPMs) and their depiction below.  If you take a close look you see that the parts are quite general, and it makes sense to assume that there is a finite basis from which these parts come from.

Parts from a Part-model

The model learns a steerable basis by factoring the matrix of all part models into the product of two low rank matrices, and because the basis is shared across categories, this performs both dimensionality reduction (like to help prevent over-fitting as well as speed up the final detectors) and sharing (likely to boost performance).

The learned steerable basis

While the objective function is not convex, it can be tackled via a simple alternating optimization algorithm where the resulting sub-objectives are convex and can be optimized using off-the-shelf Linear SVM solvers.  They call this property bi-convexity, and it doesn't guarantee finding the global optimum, just makes using standard tools easy.

While the results on PASCAL VOC2007, do not show an improvement in performance (VOC2007 is not a very good dataset for sharing as there are only a few category combinations which should in theory benefit significantly from sharing (e.g., bicycle and motorbike)), they show a significant computational speed up.  Below is a picture of the part-based car model from Felzenszwalb et al, as well as the one from their steerable basis approach.  Note that the HOG visualizations look very similar.


In conclusion, this is one paper worthy of checking out if you are serious about object recognition research.  The simplicity of the approach is a strong point, and if you are a HOG-hacker (like many of us these days) then you will be able to understand the paper without a problem.

Using Panoramas for Better Scene Understanding

This post by Tomasz Malisiewicz has been reprinted from tombone's blog.

There's a lot more to automated object interpretation than merely predicting the correct category label.  If we want machines to be able to one day interact with objects in the physical world, then predicting additional properties of objects such as their attributes, segmentations, and poses is of utmost importance.  This has been one of the key motivations in my own research behind exemplar-based models of object recognition.

The same argument holds for scenes.  If we want to build machines which understand environments around them, then they will have to do much more than predict some sloppy "scene category."  Consider what happens when a machine automatically analyzes a picture and says that it from the "theatre" category.  Well, the picture could be of the stage, the emergency exit, or just about anything else within a theater -- in each of these cases, the "theatre" category would be deemed correct, but would fall short of explaining the content of the image.  Most scene understanding papers either focus getting the scene category right, or strive to obtain a pixel-wise semantic segmentation map.  However, there's more to scene categories than meets the eye.

Well, there is an interesting paper which will be presented this summer at the CVPR2012 Conference in Rhode Island which tries to bring the concept of "pose" into scene understanding.  Pose-estimation has already been well established in the object recognition literature, but this is one of the first serious attempts to bring this new way of thinking into scene understanding.

J. Xiao, K. A. Ehinger, A. Oliva and A. Torralba.
Recognizing Scene Viewpoint using Panoramic Place Representation.
Proceedings of 25th IEEE Conference on Computer Vision and Pattern Recognition, 2012.

The SUN360 panorama project page also has links to code, etc.


The basic representation unit of places in their paper is that of a panorama.  If you've ever taken a vision course, then you probably stitched some of your own.  Below are some examples of cool looking panoramas from their online gallery.  A panorama roughly covers the space of all images you could take while centered within a place.

Car interior panoramas from SUN360 page
 Building interior panoramas from SUN360 page

What the proposed algorithm accomplishes is twofold.  First it acts like an ordinary scene categorization system, but in addition to producing a meaningful semantic label, it also predicts the likely view within a place.  This is very much like predicting that there is a car in an image, and then providing an estimate of the car's orientation.  Below are some pictures of inputs (left column), a compass-like visualization which shows the orientation of the picture (with respect to a cylindrical panorama), as well as a depiction of the likely image content to fall outside of the image boundary.  The middle column shows per-place mean panoramas (in the style of TorralbaArt), as well as the input image aligned with the mean panorama.


I think panoramas are a very natural representation for places, perhaps not as rich as a full 3D reconstruction of places, but definitely much richer than static photos.  If we want to build better image understanding systems, then we should seriously start looking at using richer sources of information as compared to static images.  There is only so much you can do with static images and MTurk, thus videos, 3D models, panoramas, etc are likely to be big players in the upcoming years.