A dataset of OCaml's Opam
If you have tried to use locally-hostable language models to develop OCaml code, then you will have noticed their performance significantly lags in more niche languages compared to Python or Javascript. Jon Ludlam, Anil Madhavapeddy and I have been doing some work on this recently and there will be more on that soon.
To improve code models, we first need data. To help with that I've created opam-archive-dataset which periodically takes the code for all packages from the ocaml/opam:archive docker image, filters for the most recent version of each package, and then converts everything into the columnar parquet format. This is a very efficient format and results in a ~800MB set of files.
To use the dataset and run queries over it, you can use the Hugging Face datasets library or if you prefer SQL then you can do the following:
# clone the dataset from huggingface
sadiq@server:opam-archive$ git clone https://huggingface.co/datasets/sadiqj/opam-archive-dataset
Cloning into 'opam-archive-dataset'...
remote: Enumerating objects: 17, done.
remote: Total 17 (delta 0), reused 0 (delta 0), pack-reused 17 (from 1)
Unpacking objects: 100% (17/17), 4.31 KiB | 315.00 KiB/s, done.
Filtering content: 100% (3/3), 388.79 MiB | 14.30 MiB/s, done.
# grab clickhouse
sadiq@server:opam-archive$ curl https://clickhouse.com/ | sh
Successfully downloaded the ClickHouse binary, you can run it as:
./clickhouse
You can also install it:
sudo ./clickhouse install
# we do not need to install it! We use clickhouse local
sadiq@server:opam-archive$ ./clickhouse local
./clickhouse local
ClickHouse local version 25.5.1.1804 (official build).
:) -- let's have a look at a few rows
SELECT * FROM file('opam-archive-dataset/data/', Parquet) LIMIT 1;
Query id: 0f786705-1568-40ac-837b-004457c3519d
Row 1:
──────
package_name: dune-action-plugin
version: 3.18.1
license: MIT
homepage: https://github.com/ocaml/dune
dev_repo: git+https://github.com/ocaml/dune.git
file_type: dune
file_path: dune-3.18.1/test/blackbox-tests/test-cases/formatting/feature.t/enabled/dune-ocaml-syntax/dune
file_contents: (* -*- tuareg -*- *)
let
() =
Jbuild_plugin.V1.send {|
(alias
(name runtest)
(action (echo "ocaml syntax")))
|}
:) -- Let's count how many rows we have
SELECT COUNT(*) FROM file('opam-archive-dataset/data/', Parquet);
SELECT COUNT(*)
FROM file('opam-archive-dataset/data/', Parquet)
Query id: 3ee6eb4b-13b7-47aa-be67-d027c81b47b0
┌─COUNT()─┐
1. │ 198862 │
└─────────┘
1 row in set. Elapsed: 0.013 sec.
:) -- How many unique packages are spawning Domains?
SELECT COUNT(DISTINCT package_name) FROM file('opam-archive-dataset/data/', Parquet) WHERE position('Domain.spawn', file_contents) > 0;
SELECT COUNTDistinct(package_name)
FROM file('opam-archive-dataset/data/', Parquet)
WHERE position('Domain.spawn', file_contents) > 0
Query id: 6f0978d9-3907-4572-bf5e-99aa4e2fceb8
┌─COUNTDistinct(package_name)─┐
1. │ 193 │
└─────────────────────────────┘
1 row in set. Elapsed: 0.723 sec. Processed 197.86 thousand rows, 402.85 MB (273.81 thousand rows/s., 557.48 MB/s.)
Peak memory usage: 385.88 MiB.
We currently extract the package name, version, license, dev repo, file type (dune, opam, mli, ml, .c and .h), file path and the contents itself.
If there are any extra fields that would be useful, let me know. Enjoy!