IBM is using its 2021 Think Conference to tout its advances in artificial intelligence (AI), including Project CodeNet.
One of the biggest challenges many companies face is translating existing codebases into another language. Artificial intelligence (AI) promises to help alleviate that problem, but requires extensive training to properly translate from one programming language to another.
IBM Research has released Project CodeNet, a dataset aimed at training AIs in source-to-source translation.
A large dataset aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
IBM says Project CodeNet is “the largest, most differentiated dataset in its class and addresses three main use cases in coding today: code search (automatically translating one code into another, including legacy languages like COBOL); code similarity (identifying overlaps and similarities among different codes); and code constraints (customizing constraints based on a developer’s specific needs and parameters).”
The company believes Project CodeNet will help revolutionize source-to-source language translation, and could be a vital resource for companies that need to move legacy codebases to modern languages.