A look into continuous mathematical modeling choices to estimate the length of an assignment in terms of duration and the number of questions.

Our previous blog post will be a good prequel before reading this post. In a nutshell, our platform implements a mastery-based pedagogy algorithmically in a variety of US higher education courseware products.

Imagine you are an instructor who is setting up an online course in our adaptive platform. You start by creating an assignment by selecting learning objectives (LOs) for that goal. As US higher education follows a credit-based model, a typical credit-bearing course syllabus will allot time for class, homework, and assignment hours. As an instructor, you should know: how long will a student take to complete an assignment?

In an adaptive world, the instructor chooses the LOs. And then Knewton’s “adaptive recommendation engine” chooses the questions that the students will be answering in an assignment. The Knewton adaptive platform (which powers WileyPLUS and Alta adaptive assignments) constantly assesses the student’s proficiency and current knowledge state and tries to close knowledge gaps (if there are any) by providing remediation content.

Given this scenario, the variation in length of an assignment is not only determined by the duration that the student will be spending per question in completing the assignment, but is also dependent on the number of questions the student will need to answer before the adaptive platform is satisfied with the student proficiency level on that learning objective. For some students, this will include the time required to practice and learn the materials. So, instructors should have visibility on upcoming work in terms of both duration as well as the number of questions that a student typically needs to answer to complete the assignment.

Several methodologies are widely adopted in our industry to answer this question. To arrive at this estimate, instructors typically use their experience, or they assess how long would it take for them to complete the assignment and multiply by some factor, or they might use surveys to gather data from students on how long it took and use that information when they set up future assignments.

But why not try to answer this question mathematically with the data/expertise already in hand. Before we dive deep into our modeling approach, what did our initial exploratory data analysis reveal to us about assignment structures?

Looking over historical data from Alta showed us that sometimes an assignment can focus just on one LO, and sometimes it can have 6 or more different LOs. Similarly, the number of questions in an assignment might vary based on student’s proficiency, current knowledge state and the content recommended by the platform. Initial EDA reveals that a student can complete an assignment by attempting anywhere from 4 to 20 questions (or sometimes more).

Further slicing of the data reveals that the learning objectives fall under different levels of difficulty such as easy, medium, and hard which drives the number of questions to completion.

What kind of data do we have, and what are the challenges around it?

All the historical interactions by the students in Alta are captured for already existing knowledge graphs and LOs. In a learning environment the dynamics keep changing, and a bunch of other factors such as change in a course or LO or its pre-requisites could impact how LOs operate. There is always a range of completion times in an LO, so we have to represent uncertainty explicitly. We also often have new content without preexisting data, and changes to the learning environment or content representation, which require us to be able scale from good guesses to data-driven estimates as data arrives or changes. We also need good guarantees and observability to ensure good user experience.

Considering the above challenges, our initial approach is to build a model that will start with smart defaults. As we receive more LO completion information, we must keep updating our best estimate based on new data. We opted to start with a prior belief about the estimate and use Bayesian updating to transform the prior beliefs into posterior every time the new data arrives. We chose a conjugate prior Bayesian model, due to the computational speed and simplicity of the closed -form solution update equations.