Abdullah Bin Faisal
Ph.D, Computer Science, Tufts University
Medford, MA
https://github.com/abdullahfsm
Publications
Skills
C, C++, Python, Tensorflow, Ray, NS2, Latex, haskell
I am a sixth year Ph.D student in the computer science department at Tufts University. My interests lie at the intersection of systems and machine learning. I completed my BSc in Electrical Engineering from LUMS in 2017. I have been excited about research since my undergrad days.
Distributed training of Deep Neural Networks is a resource hungry process. In shared compute clusters, other (training) tasks compete for different resources (e.g., network, GPUs). Thus the resources available to a particular DNN training job can vary during its training cycle, leading to unpredictable and high training time. In this space, my work focuses on two ideas: (1) designing a GPU scheduling policy that leads to predictable completion times and (2) adapting the DNN hyperparameters (e.g., architecture) to varying resource availability. Realising these ideas involve addressing interesting challenges for example we find that predictable GPU scheduling policies can substantially compromise on performance and fairness and adapting DNN hyperparameters during training can lead to sudden dip in training accuracy.
This project is about designing a learning based scheduling policy (2D) that is robust to changes in workloads (job size distributions). 2D uses principles from existing scheduling policies and learning to meet its objective of being tail-optimal in the face of changing workloads. (CoNEXT'18)
Duplication can help alleviate the problem of tail-latency when one resource becomes a straggler. However, it can double the load on the system. In this work, we investigate making duplication safe - by using prioritization and purging - and easy to implement using a high level interface. (CoNEXT'19)