Engineering the Interface: Common Evaluations

We've been having an interesting discussion at work about the best way to improve the overall rate of progress in decoding brain signals. As a community, we've been making progress, but improvements have been painfully slow. Typically, one group will announce a big breakthrough and publish it in a splashy journal, but the reality is always that "the devil is in the details" - these experiments become awfully hard to replicate since experimental details are rarely divulged. For example, if you are conducting an EEG-based BMI experiment, you might have kept the lights dim or the room temperature cool or some other detail. You might have had to reject certain trials for some reason or another - these details are crucially important to making the experiment work. And therefore, having one lab build on the success of another is damn near impossible. As a corollary, it is nearly impossible to say which lab is making the most progress or which decoding technique is the best, because there is no such thing as an apples to apples comparison in this field as of yet.

I'd like to change that.

I learned from my department chair that human language technology research had a similar problem about 30 years ago - various labs were making outrageous claims about transcribing text to speech, but since everyone was using their own proprietary data set, the claims were very hard to sort out. The solution came from the National Institute of Standards and Technology (NIST). NIST decided to institute an annual challenge to the community: transcribe these phone conversations, detect speaker language, etc. NIST provided the data sets, so that everyone was working off the same data, and finally it became possible to objectively compare the performances of various labs and algorithms. As the project grew year after year, it became necessary to set up the Linguistics Data Consortium (LDC) to design and create ever more sophisticated data sets. Part of the challenge is to properly design a data set such that all the control cases are properly addressed and that the desired algorithms can be properly tested. Once the data set is designed, LDC collects and disseminates the data. LDC will also archive and disseminate data from independent laboratories. The LDC is hosted by the University of Pennsylvania and currently has about 50 employees and has amassed about 500 data libraries in over 60 languages.

The advantages are manifold. First, by having a community-wide competition, attention and energy is focussed on the most important problems. Program managers from federal funding agencies can be instrumental in setting these goals. Secondly, by having common data libraries, the community is able to effectively ferret out the real differences between various algorithms and techniques - this is a boon for overall progress. And finally, relative to the overall amount of money being spent by funding agencies to fuel all this research, the cost of collecting and distributing the data is relatively minor by comparison.

The following graphic shows how the Common Evaluations paradigm of NIST and LDC has propelled progress in the human language technology field. As time has progressed, the challenges have become progressively harder and yet progress is always forthcoming. As a bonus, these data libraries have also become a real boon for industry players who wish to incorporate language technology into their products: these companies now have "industry-standard" data sets to build their algorithms around. So its not just good for progress in research, but also in industry.

The challenges in making such a data consortium work are also manifold. First, it won't work without the consensus of the scientific community that this is a valuable exercise. If the main labs and key players refuse to participate, then the whole exercise becomes less useful. The main way to resolve this potential problem is to (a) directly engage the community and sell them on the importance of the concept and (b) to convince program managers at funding agencies to insist that their PIs participate in the consortium. Beyond the engagement issue, there are secondary problems such as funding, scope, organization, and so on. But none of these issues are show-stoppers. We believe there is a need for an LDC-like operation in the neural engineering world, and we are pursuing efforts to start such an endeavor, to be hosted (naturally) at Temple University.

Engineering the Interface

Tuesday, November 8, 2011

Common Evaluations

No comments:

Post a Comment

Search This Blog