Here's a sneak peek of the final product:
The Challenge
Create an application that tracks a user's face/body and makes some sort of sound (music) in response. Or put another way, track some sort of motion on the screen using a webcam, and translating that into some sort of audio.
This was a coding challenge presented by Colt Steele, one of my favorite online instructors. I figured this would be a unique project to try and complete so I decided to give it a whirl. In this article I'll describe how I accomplished the task and the things I learned along the way.
Check out the Github repo and play around with the final product here.
handtrack.js Library
All of the magic for this project is provided by the handtrack.js library.
According to their docs, handtrack.js is:
A library for prototyping real-time handtracking in the browser. Handtrack.js lets you track hand position in an image or video element, right in the browser.
It makes use of Tensorflow, a machine learning platform, for hand detection. There are a lot of other technologies used to make this library possible so, if you're interested in how handtrack.js was built, you can read this article to learn more.
The HTML
The HTML for this project is pretty simple and includes only two files. The index.html is a landing page with a big red button prompting you to enable your video camera before entering the main page. I wanted to add this as a courtesy to the user so that they have a heads up that the next page will ask for permission to use their camera.
The second file uses a CDN to load in the handtrack.js model. In this file, there are three elements that will play a key role in tying everything together:
<video>
- This element will have the webcam streamed through it in real-time.<audio>
- This element is used to play the snippets of music.<canvas>
- This element is used to draw the music genre labels to the screen. Itβs also used to display the blue borders around the hand.
The CSS
I have quite a few classes defined within the CSS file, but the main ones to point out are the styles that are being applied to the video
and canvas
elements.
There's a display: none
being applied to the video
element because the library needs access to the video
element to register the hand movement, however, it uses the canvas
element to render (display) the hand tracking predictions onto the screen.
To add a little extra flair β¨ (and thanks to a great suggestion from my husband), I have a grayscale
, sepia
, and rocknroll
class which apply their respective CSS filters based on the genre of music the hand hovers over.
The JavaScript - First Half
The first half of the code deals with the initial setup that ensures everything will run smoothly later on in the file. This portion of the file can be broken down into four main parts:
- Defining the
modelParams
. This object is used to configure things like how many hands we want to track and the minimum confidence threshold desired before rendering it as a hand. - Using the
handtrack
API to load the model and start the video. - Assigning the pertinent HTML elements to variables so that they can be referenced throughout the file to render predictions to the canvas, load the video, and play music.
- Defining a
genres
object that contains two attributes:filter
: the name of the CSS class that will apply the relevant CSS filter property.source
: the URL of the audio we want to be played.
const modelParams = {
flipHorizontal: true, // flip e.g for video
imageScaleFactor: 0.7, // reduce input image size for gains in speed.
maxNumBoxes: 1, // maximum number of boxes to detect
iouThreshold: 0.5, // ioU threshold for non-max suppression
scoreThreshold: 0.8, // confidence threshold for predictions.
};
const genres = {
classical: {
filter: "sepia",
source: "https://ccrma.stanford.edu/~jos/mp3/oboe-bassoon.mp3",
},
jazz: {
filter: "grayscale",
source: "https://ccrma.stanford.edu/~jos/mp3/JazzTrio.mp3",
},
rock: {
filter: "rocknroll",
source: "https://ccrma.stanford.edu/~jos/mp3/gtr-dist-jimi.mp3",
},
};
const video = document.getElementsByTagName("video")[0];
const audio = document.getElementsByTagName("audio")[0];
const canvas = document.getElementsByTagName("canvas")[0];
const context = canvas.getContext("2d");
let model;
function loadModel() {
handTrack.load().then((_model) => {
// Initial interface after model load.
// Store model in global model variable
model = _model;
model.setModelParameters(modelParams);
runDetection();
document.getElementById("loading").remove();
});
}
// Returns a promise
handTrack.startVideo(video).then(function (status) {
if (status) {
loadModel()
} else {
console.log("Please enable video");
}
});
The JavaScript - Second Half
The second half of the code deals more with the actual functionality. This part of the code is broken down into three functions:
applyFilter
- This is a simple helper method that takes in the class name of the filter and properly adds and removes it to to the canvas.drawText
- This is another helper method I created to assist with drawing out the genre names to the canvas.runDetection
- This is the main method that ties everything together. The first line in the function calls model.detect which is used to detect the hands. Thedetect
method takes in thevideo
element and returns an array of bounding boxes with confidence scores. This method also callsmodel.renderPredictions
which is used to render the hand predictions that will be displayed on the canvas.- After that, I draw the genres to the canvas and call requestAnimationFrame to continually update the browser with the animations. Last but not least, I take the predictions (an array of results from the
detect()
method) and, based on where the hand predictions are measured on the x-axis and y-axis of the canvas, I am able to determine if it is hovering over a certain genre. If the code detects that the hand predictions are within the range of where the text was drawn on the canvas, it will play the appropriate music snippet and apply the corresponding CSS filter.
- After that, I draw the genres to the canvas and call requestAnimationFrame to continually update the browser with the animations. Last but not least, I take the predictions (an array of results from the
function applyFilter(filterType) {
if (canvas.classList.length > 0) canvas.classList.remove(canvas.classList[0]);
canvas.classList.add(filterType);
}
function drawText(text, x, y) {
const color = "black";
const font = "1.5rem Rammetto One";
context.font = font;
context.fillStyle = color;
context.fillText(text, x, y);
}
function runDetection() {
model.detect(video).then((predictions) => {
//Render hand predictions to be displayed on the canvas
model.renderPredictions(predictions, canvas, context, video);
//Add genres to canvas
drawText("Rock πΈ", 25, 50);
drawText("Classical π»", 250, 50);
drawText("Jazz π·", 525, 50, "");
requestAnimationFrame(runDetection);
if (predictions.length > 0) {
let x = predictions[0].bbox[0];
let y = predictions[0].bbox[1];
//Apply proper music source and filter based on hand position
if (y <= 100) {
if (x <= 150) { //Rock
audio.src = genres.rock.source;
applyFilter(genres.rock.filter);
} else if (x >= 250 && x <= 350) { //Classical
audio.src = genres.classical.source;
applyFilter(genres.classical.filter);
} else if (x >= 450) { //Jazz
audio.src = genres.jazz.source;
applyFilter(genres.jazz.filter);
}
audio.play(); //Play the sound
}
}
});
El Fin ππ½
Check out the final product here or watch the silly demo below π.
I think it is super cool that a library like handtrack.js obfuscates enough of the complicated bits while still providing an intuitive interface. I hope this was a fun read and showcases the interesting things you can create with JavaScript!
If you enjoy what you read, feel free to like this article or subscribe to my newsletter, where I write about programming and productivity tips.
As always, thank you for reading and happy coding!