Effective Problem Formulation
Effective problem formulation, focusing on understanding the problem, visualizing goals, and engaging end users, is crucial for data science success.
Hello there, it’s Fede Nolasco from datatunnel, and today I want to discuss a crucial component of data science projects – effective problem formulation. Often underestimated, this skill can steer the direction of a project and significantly influence its success.

Effective problem formulation
The Importance of Problem Formulation
In the realm of data science, problem formulation is much like setting the stage for an intricate dance. It’s about understanding the terrain, identifying your goals, and realizing what resources (in our case, data) are at your disposal. Now, this isn’t as simple as it might seem. To ensure a common understanding of the issue at hand and potential solutions, collaboration between the data science team and subject matter experts is paramount.
By articulating a clear problem statement, we establish a shared vision of success among stakeholders and the data science team. Consequently, we can design measurable hypotheses, adding a layer of scientific rigor to our project.
Walkthrough of Problem Formulation Stages
Comprehending the Problem Domain
Problem formulation kicks off with understanding the problem domain thoroughly. We pose questions like:
• What is the problem we are aiming to solve?
• Why does this problem need a solution?
• Is a machine learning solution necessary?
• How would a potential solution be used?
Your answers here will shine a light on the problem’s details, guiding the course of your data science project.
Visualizing Goals
During the goal visualization process, we determine potential solution use cases, decide on performance metrics, and ascertain the available data. In this stage, we aim to achieve a shared understanding of the problem domain, a clear objective, and an established method for appraising potential solutions.
Engaging the End User
End user engagement is a frequently overlooked yet vital part of problem formulation. By grasping their current practices, pain points, and solution vision, we gain valuable insights for formulating the problem. Some guiding questions might be:
• How well does the current solution perform?
• What is the condition of the data used to build the solution?
• How does the end user or expert imagine the solution?
Real-World Application: Meta Definitions for Data Assets via Generative AI
To illustrate these principles, let’s delve into a data asset management scenario using generative AI. Many organizations grapple with maintaining precise and significant meta definitions for their numerous data assets. An accurately managed data asset inventory complete with correct metadata bolsters data discoverability, data quality, and overall data governance. Now, imagine we’re assigned to develop a generative AI model that can automate metadata descriptions for data assets.
Rather than stating our objective as “automating metadata generation,” we could refine it to “enhancing data governance by automatically generating precise and consistent meta definitions for data assets, thereby reducing manual intervention and potential human errors.”
Several factors come into play when understanding the problem. Firstly, we need to grasp the data assets’ nature and their organizational usage context. Furthermore, we need to define what makes a meta definition ‘good’ – aspects such as clarity, conciseness, meaningfulness, and relevance could be essential parameters.
When visualizing the solution, we consider how the generative AI model would interact with the data assets. It should access the existing data, understand its structure and semantics, and generate a meaningful description. Performance measures might include the accuracy of the generated meta definitions, their relevance, and the time saved in manual efforts.
In terms of end-user understanding, we need to engage data stewards, data scientists, and other data users. Here are some questions that might help:
• How are meta definitions currently created and updated?
• What are the challenges in the existing process?
• What attributes would they like to see in the generated meta definitions?
• How do they envision the interaction with an AI-based system?
Armed with this information, we can formulate our problem more effectively and build a robust foundation for a generative AI model that automates the creation of meta definitions for data assets.
In Conclusion
Problem formulation is undoubtedly a linchpin in data science. A well-framed problem not only provides direction but also fosters a common understanding among stakeholders. By focusing on understanding the problem domain, visualizing the goals, and incorporating the end user’s perspective, data science teams can position themselves for project success. As always, keep up with us for more insights into the world of data science. Until next time, keep formulating!