Sourcing and Validating Data: Building the Right Dataset (CBDA: Source Data)
Aliaksei Khavanski
Expert Contributor
June 15, 2026
Last Updated
Once a research question is framed, the next temptation is to grab whatever data is lying around and start analysing. The IIBA Guide to Business Data Analytics calls this out directly: Source Data is a top-down exercise. You start from the context of the problem and determine what type of data must be used — not from whatever dataset happens to be available. This guide covers why the domain matters for the CBDA exam, its four tasks, a deep dive into planning data collection with a worked example, a sourcing checklist, and the traps examiners love.
Why Source Data carries real weight
Source Data is roughly 15% of the CBDA exam — and it is the domain where the Guide most sharply distinguishes the business analysis professional from the data scientist. Data scientists see datasets as a set of variables; the business analysis professional brings the insight to determine whether a dataset is useful within a business context, because they understand the meaning behind the data variables. That is why a well-structured analytics team deliberately combines both business and data science skills when sourcing data. If you arrive here without a validated research question, go back and read From Business Problem to Research Question first — Source Data exists to serve the questions framed in Identify the Research Questions, not the other way around.
The domain's tasks demand the most effort of any in the competency model, but the starting point is always the same: understand the problem context, then determine the data.
The four tasks of Source Data
2.2.1 Plan Data Collection. Before any data is sourced, analysis determines what data is most relevant to the analytics problem. The plan covers what data is needed, its availability, the need for historical data, when and how it will be collected, and how it will be validated once collected. The deep dive below unpacks this central task.
2.2.2 Determine the Data Sets. Review the data expected from each source and pin down specifics: data types, dimensions, sample size, and relationships between data elements. Decide which whole and which partial datasets to collect — an entire spreadsheet, or specific rows within it. Identify data gaps, where data doesn't exist or is missing due to errors such as a failure in the collection process. The Guide's signature tool here is the : (amount and size of data), (speed of generation and frequency of collection), (sources, formats, types), (trustworthiness — uncertainties and inconsistencies), and (the analytics must be driven by real, valuable business goals). Analysts also weigh cost versus benefit per dataset — ideally the team collects its own data from scratch to reduce external biases, but resources rarely allow it. A hard truth the exam tests: when the data required to answer it is too expensive to obtain. Data profiling and data sampling are the workhorse techniques.
2.2.3 Collect Data. The Guide splits collection into two approaches. Passive data collection is unobtrusive — data generated by users in their day-to-day transactions with the organization (point-of-sale data, web and mobile data), available without an analytics objective in mind. Active data collection actively seeks information from stakeholders for a specific goal — surveys and self-reports — where the analyst applies best practices such as open versus closed-ended questions, Likert scales, paired-comparisons, and question flow. Before collecting at scale, test the approach on a small number of observations — pilot the survey with a small population first. Analysts also trace data lineage: where the data comes from, what transformations are performed, and where it is finally stored. When sources differ — source A codes gender numerically, source B alphabetically — the need to reconcile data elements must be identified, and some discrepancies require domain knowledge that no program can supply.
2.2.4 Validate Data. Validation at this stage is deliberately high-level — detailed analysis comes later. It splits in two. Business validation: business stakeholders approve the data sources and establish the acceptance criteria that define the parameters for assessing accuracy. Technical validation: technical testing to assess data quality against five characteristics — Accuracy (correct, not misleading), Completeness (nothing expected is missing — no nulls in required fields), Consistency (the same value for a data element across sources), Uniqueness (no duplicates), and Timeliness (fresh, current, for the period requested). Data mapping and business rules analysis support this task.
Task 2.2.1 in depth: planning data collection
Planning is where the top-down discipline either happens or doesn't, so it deserves a worked example. Picture the subscription business from the previous domain: churn has risen from 3% to 5%, and the diagnostic research question is "Which factors — support-ticket volume, delivery delays, price changes, or feature usage — are most strongly associated with the rise in churn?"
A data collection plan for that question works through the Guide's considerations in order:
What data is needed. Derived from the question, not from the warehouse: support tickets, billing and price-change history, delivery records, product usage events, and cancellation reasons. Some of this is currently collected (whether used or not); cancellation reasons may be data that is not currently collected but would help answer the analytics problem.
Availability. Tickets live in the CRM, billing in the finance system, usage events in a data lake. Cancellation reasons don't exist yet — that calls for active collection, such as a short survey embedded into the cancellation workflow, the way the Guide suggests embedding point-of-sale surveys into business processes. Some support transcripts may be off-limits under privacy rules.
Historical data. The question compares two quarters against a baseline, so at least four quarters of history are required.
When and how collected. Usage and billing arrive via monthly batch; the cancellation survey collects continuously. Where the business needs data more frequently than it is currently collected, the Guide requires an assessment of the costs to obtain it at a more regular interval.
How it will be validated. Defined now, not after the fact — which acceptance criteria and quality checks will apply once data lands.
Two more planning calls matter. First, best source selection: if churn-relevant data exists both in the centrally managed data warehouse and in a peripheral secondary source where it has already been manipulated, a direct pull plus your own manipulation may cost more — but is acceptable if the secondary source's quality is questionable. Second, the structured/unstructured split: billing rows are structured (queryable via SQL in a DBMS); support transcripts are unstructured (text, emails, audio) and demand significantly more work — use them only if the team has the tools, experience, and skills. Non-functional requirements — privacy, security, retention, volume, timing, integration, and frequency, plus constraints from existing service level agreements — get written into the plan, not discovered later.
Finally, the plan is not yours to approve. Stakeholders who are impacted or own the data review it with the analytics team, and the analyst facilitates the team to consensus to obtain approval — the same facilitator stance as in the previous domain.
The Source Data toolkit (2.2.5)
The Guide names a working set of techniques for this domain: Acceptance and Evaluation Criteria (define what "right data" means from business and technology perspectives), Data Dictionary (consistent data labels across the initiative), Data Flow Diagrams (conceptual/logical view of data collected and stored), Data Modelling (data elements and interrelationships in conceptual, logical, and physical form), Document Analysis (gather information about internal source systems), Interface Analysis (how data is captured versus how it is stored), Non-Functional Requirements Analysis (privacy, volume, frequency, retention, integrity), Survey or Questionnaire (active collection of data not readily available), Data Mapping (traceability between data elements and sources — owner, availability, frequency, constraints, transformations), and ETL and Data Management Techniques (extract and curate data without compromising ongoing business operations).
A data sourcing checklist
Before declaring a dataset ready for analysis, run it through this:
Traced to a research question — the data was chosen top-down from the problem context, not because it was convenient.
Best source selected — granularity, quality, and cost compared across candidate sources.
Gaps identified — missing or non-existent data is named, with a plan (collect actively, or drop the question).
Five Vs assessed — volume, velocity, variety, veracity, value all considered.
NFRs and SLAs cleared — privacy or security considerations can deem a dataset unfit for use.
Lineage understood — origin, transformations, and final storage are known.
Business validation done — stakeholders with authority approved the sources and set acceptance criteria.
Top-down, always. When a stem offers "start with the data already in the warehouse," the correct answer starts instead from the problem context and the research question.
BA vs data scientist. The business analysis professional's distinctive contribution is knowing what variables mean to the business — answers that reduce the analyst to a data extractor are wrong.
Passive vs active. Point-of-sale and web data = passive; surveys and self-reports = active. Expect to classify a scenario instantly — and remember to pilot before full-scale collection.
Five Vs vs five quality characteristics. The five Vs (volume, velocity, variety, veracity, value) select datasets in Determine the Data Sets; accuracy/completeness/consistency/uniqueness/timeliness assess quality in Validate Data. Questions love to swap them.
Two validations. Business validation = stakeholders approve sources and set acceptance criteria; technical validation = testing data quality. Validation at this stage is high-level.
Dropping a question is a valid answer. If the data is too expensive to obtain, the Guide says the research question may be dropped — exam options that admit this are often correct.
Source Data is ~15% of the exam and is a top-down exercise: problem context first, data second.
Walk the four tasks in order — Plan Data Collection, Determine the Data Sets, Collect Data, Validate Data.
A data collection plan covers what data, availability, historical needs, when/how to collect, and how to validate — and is approved by stakeholder consensus, facilitated by the analyst.
Use the five Vs to choose datasets; use accuracy, completeness, consistency, uniqueness, and timeliness to validate them.
Know passive vs active collection, data lineage, and the business/technical validation split cold — they are the domain's most testable distinctions.
Source Data rewards the analyst who treats data as a means to answer a framed question, not as a starting point. Get the plan right, validate before you analyse, and the downstream analysis domains become dramatically easier.