Data Lineage Tracking for AI: Complete Guide | QuizBy Eyal Doron / December 6, 2025 / 1 minute of reading Data Lineage Tracking for AI: Complete Guide | Quiz 1 / 8 1. What does the EU AI Act require regarding training data according to the article? 1. Only the model output needs to be documented 2. Training data documentation for high-risk systems and demonstrable traceability requirements 3. No documentation is required for any AI systems 4. Documentation is optional for all risk levels Correct! Why: The EU AI Act requires training data documentation for high-risk AI systems demonstrating what data trained the model and its characteristics plus traceability requirements. Context: Lineage is the technical foundation for meeting these regulatory requirements. Remember: Document training data plus demonstrate traceability. 2 / 8 2. What Quick Win does the article recommend for starting lineage tracking? 1. Document all lineage manually in spreadsheets 2. Purchase an enterprise lineage platform immediately 3. Mandate DVC or MLflow to version control training dataset and model file for highest-risk AI system 4. Hire a dedicated lineage team Correct! Why: The article recommends mandating DVC or MLflow to version control the training dataset and model file for your highest-risk AI system this week. Context: This creates the essential model-to-data linkage required for basic compliance. Remember: Version control training data and model for highest-risk system. 3 / 8 3. What tools does the article recommend for different lineage roles? 1. Any database system works equally well 2. Only spreadsheets and manual documentation 3. MLflow for experiment tracking – DVC for data versioning – Apache Atlas for enterprise lineage 4. Custom tools must be built from scratch Correct! Why: The article recommends MLflow for experiment tracking and model-to-data linkage and DVC for dataset version control and Apache Atlas for enterprise lineage and regulatory audits. Context: Tool integration is one approach for practical lineage implementation. Remember: MLflow for experiments – DVC for data – Atlas for enterprise. 4 / 8 4. Why is transformation code versioning essential according to the article? 1. It reduces storage costs 2. It is only needed for compliance audits 3. Capturing Git hash lets you know exactly which code version processed the data 4. It makes the code run faster Correct! Why: Capturing the Git hash of the cleaning script lets you know exactly which code version processed the data enabling reproducibility. Context: This is part of documenting every transformation applied to raw data during preparation. Remember: Git hash equals reproducible transformations. 5 / 8 5. What metadata should be captured during the data collection stage? 1. Only metadata required by the AI model 2. Source system identification – collection timestamps – consent and permission metadata 3. Only the file size and format 4. Just the database connection string Correct! Why: The article specifies capturing source system identification (which database or API) and collection timestamps (when data was extracted) and consent and permission metadata (legal basis for use). Context: This metadata becomes critical for GDPR compliance. Remember: Source – Timestamp – Consent. 6 / 8 6. Why does feature engineering obscure data origins according to the article? 1. Features are stored in different databases than source data 2. Feature engineering deletes the original data 3. Derived features like ratios and aggregations create indirect connections to dozens of underlying data points 4. Engineering transforms data into unreadable formats Correct! Why: When you derive new features like ratios and aggregations and embeddings the connection to original data becomes indirect – a customer_risk_score might derive from dozens of underlying data points. Context: This is one of several factors that make AI lineage harder than traditional data lineage. Remember: Derived features hide their sources. 7 / 8 7. What are the six components of AI lineage described in the article? 1. Collection – Validation – Training – Testing – Production – Retirement 2. Source – Transformation – Model – Deployment – Inference – Governance 3. Authentication – Authorization – Encryption – Logging – Monitoring – Alerting 4. Input – Processing – Output – Storage – Backup – Archive Correct! Why: The article identifies source lineage and transformation lineage and model lineage and deployment lineage and inference lineage and governance lineage as the six interconnected elements. Context: Together these create the end-to-end chain-of-custody for AI systems. Remember: Source – Transform – Model – Deploy – Infer – Govern. 8 / 8 8. What is the difference between forward and backward lineage? 1. Forward is for new data while backward is for historical data 2. Forward traces source to output while backward traces output to source 3. Forward is automatic while backward requires manual effort 4. Forward is for training while backward is for inference only Correct! Why: Forward lineage traces data from source to output answering what happened to this data while backward lineage traces from output to source answering where did this prediction come from. Context: Both directions matter – forward supports compliance and auditing while backward enables debugging and explanation. Remember: Forward equals source to output – Backward equals output to source. Your score isThe average score is 0% Restart quiz Download PDF Please leave this field empty🔐 The AI Security Manager's Newsletter Weekly insights on AI risk management, EU AI Act compliance, and practical security strategies. We don’t spam! Read our privacy policy for more info. Thank you! Please check your inbox to confirm your subscription.