Workshop Resources

KEYNOTE by Rocco Oliveto.
Categorizing API Forum Discussions by Daqing Hou and Lingfeng Mo.
Can Text Mining Assistants Help to Improve Requirements Specifications? by Bahar Sateli, Elian Angius, Srinivasan Rajivelu and Rene Witte.

Panel Discussion Session

Steven Thomas (Queen's University)

Challenges Faced in the Past

Lack of tools for Mining Unstructured SE data biggest hurdle when starting out in the field.
Pre-Processing of SE data very challenging and lack of readily available tools.
Analysis of the results/outputs of NLP analysis tools (such as LDA) not straightforward to do.
Tools and methods for visualizing NLP results are virtually non-existent. Hand-crafting custom solutions needed.

To tackle these challenges, Steve has prepared a freely available set of tools for working with NLP methods on SE data.

Road Ahead

We need better methods for linking SE artifacts (source code, documentation, ...) with higher, or at least reasonable performance. Current tools do not do a good enough job to be useful in practice.
The speed of NLP tools, especially when applied to rather large SE data repositories is very slow.
There is a real need to leave the academic bubble and corroborate and validate NLP based methods and results with real developers.

Discussion

Who is responsible for creating new NLP tools in SE context, or modifying existing ones? SE researchers are in the mindset of being customers/consumers for existing technology, while at the same time NLP researchers are unaware of the SE domain. David Lo's research group has started building custom NLP tools from scratch that are adapted to the SE context.

Elliot Chikofsky: 'Unstructured' data only makes sense when we are trying to transform unstructured information into structured information (so that computers can work with that information). Mining unstructured data is closely related to transforming what appears to be noise into useful and useable patterns.

Olga Baysal (University of Waterloo)

Challenges Faced in the Past

Big question when working in MUD is "what technique/tool/model should you use for a specific problem at hand?"

Road Ahead

The community would greatly benefit from a collection of knowledge of tools/techniques and particular step-by-step walkthroughs on what works in certain situations of Mining Unstructured Data and what not.

Discussion

There appears to currently be a lack of venues for publication of such papers (that include detailed manuals and descriptions of what works - and what not). Perhaps a workshop like MUD would be a platform for collection such knowledge and making it available to a broader audience within the Mining Repositories Community. The ultimate goal may be to develop a "cookbook" with recipes for mining unstructured data using NLP and IR techniques/tools. Elliot Chikofsky suggests that even though solutions may be very specific to certain scenarios and contexts, collecting these can still be valuable to create a script such as those that support hotlines follow when customers call for support.

Surafel Lemma Abebe (Fondazione Bruno Kessler)

Challenges Faced in the Past

Extracting concepts (from source code using identifier names) is very challenging due to mixture of natural language with programming language and domain specific terms.
Existing tools (NLP) don't work particularly well for parsing identifier names.

To tackle both challenges, Rafael adopted existing NLP tools for this specific problem.

Road Ahead

MUD is still largely a manual, sometimes semi-automated process. To scale to modern repositories, the community needs to find ways to fully automate semantic knowledge extraction from unstructured data.

Anthony Cleve (University of Namur)

Challenges Faced in the Past

His main objective during his PhD was to mine poorly structured data (legacy Cobol projects)
His current work involves finding source code in emails
From both 1. and 2. he sees SE-specific tools for mining unstructured data a big challenge, with one of the main problems being the ambiguous nature of unstructured data. The user/consumer of the unstructured data has to impose some sort of structure on that data, but there are many possible ways to skin a cat.

Road Ahead

The need of being able to mine unstructured data will only get greater in the coming years, and the problem will become harder (example: mining NoSQL, structureless databases). Currently the lack of researchers and practitioners with experience in MUD is evidenced by an explosion of open job positions that look specifically for this skill set.

Discussion

A possible solution to tackle this ambiguity might be to mitigate weak data and techniques by a majority vote approach: adding more data or running alternate techniques on the same data and accepting results that form the middle ground between multiple runs.

David Lo (Singapore Management University)

Challenges Faced in the Past

We currently have no good labeled SE datasets that we can use to create MUD tools on, or benchmark new approaches against. Good quality, labeled SE datasets are essential for developing robust and high performance techniques for MUD.
Our community currently has the luxury of too much data - we don't know what to look at first. That has led to a large amount of shallow research (low hanging fruit) on the one hand, but also lack of focus on the other hand (open field - nobody knows where the greatest value lies).

Road Ahead

In most unstructured data, there is implicit structure hidden (that we not even may know about unless we talked to practitioners). We need to find ways to unlock this implicit knowledge as it might be highly valuable for creating better techniques and accumulating more knowledge from unstructured data(-sets).