Helpful Info for AI/ML Projects

Below is a list of helpful resources for AI/ML projects. Whether you are starting from the ground up and training your own models, or using a pre-configured solution, these resources will help you get started.

Commonly Used AI/ML Frameworks and Libraries

TensorFlow

TensorFlow is an open-source software library for machine learning across a range of tasks, and developed by Google to meet their needs for systems capable of building and training neural networks to detect and decipher patterns and correlations, analogous to the learning and reasoning which humans use. It is currently used for both research and production at Google products, including speech recognition, Gmail, Google Photos, and search, many of which were previously performed by standard pattern recognition algorithms.

PyTorch

PyTorch is an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab (FAIR). It is free and open-source software released under the Modified BSD license. Although the Python interface is more polished and the primary focus of development, PyTorch also has a C++ interface.

Keras

Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. Keras was developed to enable deep learning engineers to build and experiment with different models very quickly. Just as TensorFlow is a higher-level framework than Python, Keras is an even higher-level framework and provides additional abstractions. It was developed as part of the research effort of project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System), and its primary author and maintainer is François Chollet, a Google engineer. Chollet also is the author of the XCeption deep neural network model.

Scikit-learn

Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

NumPy

NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.

Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is discouraged. SciPy makes use of Matplotlib.

SciPy

SciPy is a free and open-source Python library used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.

OpenCV

OpenCV is a library of programming functions mainly aimed at real-time computer vision. Originally developed by Intel, it was later supported by Willow Garage then Itseez (which was later acquired by Intel). The library is cross-platform and free for use under the open-source BSD license.

Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

AI/ML Datasets

Kaggle

Kaggle is an online community of data scientists and machine learners, owned by Google LLC. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. Kaggle got its start by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and short form AI education.

UCI Machine Learning Repository

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets. As an indication of the impact of the archive, it has been cited over 1000 times, making it one of the top 100 most cited "papers" in all of computer science. The current version of the web site was designed in 2007 by Arthur Asuncion and David Newman, and was developed by Arthur Asuncion, UC Irvine.

Google Dataset Search

Google Dataset Search is a search engine from Google that helps researchers locate online data that is freely available for use. The company launched the service on September 5, 2018, and stated that the product was targeted at scientists and data journalists. The service indexes data from government agency databases, public sources, and digital libraries. It was inspired by the Fake News Challenge and the work of Altmetric.

AWS Public Datasets

AWS hosts a variety of public datasets that anyone can access for free. This includes datasets from the U.S. Census Bureau, NASA, NOAA, and many other organizations and companies.

Microsoft Research Open Data

Microsoft Research Open Data is a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences.

Stanford Large Network Dataset Collection

The Stanford Large Network Dataset Collection (SNAP) is a collection of datasets from a variety of domains and disciplines. SNAP is designed to facilitate empirical research in network science and network mining. SNAP is being developed by Jure Leskovec and collaborators at Stanford University, with the help of many contributors.

Google Cloud Public Datasets

Google Cloud Public Datasets provide a playground for those new to big data and data analysis and offers a powerful data repository of more than 100 public datasets from different industries, allowing you to join these with your own to produce new insights. Google Cloud Public Datasets are hosted on Google Cloud Storage and can be accessed by anyone.

Data.gov

Data.gov is a U.S. government website launched in late May 2009 by the then Federal Chief Information Officer (CIO) of the United States, Vivek Kundra. According to its website, The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government. The site is a repository for federal, state, local, and tribal government information, made available to the public. Its datasets are available in nine languages, including English, French, German, and Spanish. It contains data from a range of federal agencies, covering agriculture, business, climate, consumer, ecosystems, education, energy, finance, health, local government, manufacturing, ocean, public safety, and science and research.

Datahub

Datahub is a community-run catalogue of useful sets of data on the Internet. You can collect links here to data from around the web for yourself and others to use, or search for data that others have collected.

Data.world

Data.world is a social network for data people. It's a platform for data scientists and analysts to find and share data, connect with other users, and work together to solve data problems.

Helpful Info for AI/ML Projects

Commonly Used AI/ML Frameworks and Libraries​

TensorFlow​

PyTorch​

Keras​

Scikit-learn​

NumPy​

Pandas​

Matplotlib​

SciPy​

OpenCV​

Jupyter Notebook​

AI/ML Datasets​

Kaggle​

UCI Machine Learning Repository​

Google Dataset Search​

AWS Public Datasets​

Microsoft Research Open Data​

Stanford Large Network Dataset Collection​

Google Cloud Public Datasets​

Data.gov​

Datahub​

Data.world​