Matias Carrasco Kind

University of Illinois, Urbana-Champaign

[intermediate] Anomaly Detection

Summary

In the age of big data and high volume information, anomaly detection finds many areas of application, including network security, financial data, medical data analysis, and discovery of celestial events from astronomical surveys, among many more. The need for reliable and efficient algorithms is plentiful, and there are many techniques that have been developed over the years to address this need including multivariate data and more recently, streaming data with need for updates on data with missing variables. Anomalous data can have as much scientific value as normal data or in some cases even more, and it is of vital importance to have robust, fast, and reliable algorithms to detect and flag such anomalies. We will discuss different algorithms to identify anomalies in all kinds of data, including multi-dimensional and time-series data, with a deep dive into the fundamentals and tips to identify the best algorithm suited for each situation, including the ones we have developed. Code, data and python examples will be provided.

Syllabus

Introduction to Anomalies and Outlier detection
Deep dive into the statistical and probability framework for anomalies
Machine Learning algorithms for anomaly detection, supervised, unsupervised and parameter-free
Deep Learning algorithms for anomaly detection, including VAE, GANs and others
Outlier detection for time series data

References

Zimek, Arthur; Schubert, Erich (2017), “Outlier Detection”, Encyclopedia of Database Systems, Springer New York, pp. 1–5.
Hodge, V. J.; Austin, J. (2004). “A Survey of Outlier Detection Methodologies” . Artificial Intelligence Review. 22 (2): 85–126.
S. Hariri, M. Carrasco Kind and R. J. Brunner, “Extended Isolation Forest,” in IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 4, pp. 1479-1489, 1 April 2021, doi: 10.1109/TKDE.2019.2947676.
Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Pre-requisites

General knowledge of statistical learning. Basic knowledge of probability and statistics. Basic knowledge of linear algebra.

Short bio

Matias Carrasco Kind is currently the Director of the Data Science Research Services at the Gies College of Business at the University of Illinois at Urbana-Champaign in the U.S, where he is also a Faculty in Accountancy, Astronomy and at the National Center for Supercomputing Applications.

He is interested in challenging problems involving data-intensive science, machine, and deep learning, data mining, data analysis and visualization, image processing, AI generative models, scientific platforms and cyberinfrastructure, data management, software engineering, and scientific cloud computing, among others. Most of his research has been focused on Astrophysics but given the multidisciplinary nature of his work, and the common needs and tools across multiple fields regarding data, he has also applied these techniques to earth sciences, bio-imaging, veterinary, agricultural economics, finance research, and accounting.

Matias obtained his PhD in Astronomy with a Computational Science and Engineering option at the University of Illinois which focused on machine learning techniques applied to astronomy at large scales.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.