Home
Edstem EECS101 Class Questions Chatbot
If you are not able to see the report above, please download the PDF to view it: Download PDF.
Fine-tuning LLaMA2 on EdStem Question and Answer Data using LoRA
In this project, the team explored fine-tuning a Large Language Model (LLM), specifically LLaMA-2, for generating accurate responses to questions related to EECS 101 at UC Berkeley. The fine-tuning process aimed to adapt the model to produce coherent and contextually relevant answers similar to those found on the EECS 101 EdStem forum.
Objectives and Hypotheses:
- Fine-tuning Feasibility: Assess whether fine-tuning LLaMA-2 can generate relevant responses to EECS 101 questions.
- LoRA Hyperparameters: Determine optimal LoRA (Low-Rank Adaptation) hyperparameters, focusing on different rank values (1, 4, 16, 32) to enhance computational efficiency without sacrificing performance.
- Pre-fine-tuning Effectiveness: Investigate if pre-fine-tuning on a larger, more general QA dataset (StackExchange) improves performance on the specialized EECS 101 dataset.
Methods:
- Datasets:
- Stack Exchange Dataset: Consisting of diverse QA pairs from Stack Exchange forums.
- EdStem Dataset: Manually compiled EECS 101-related QA pairs from the EdStem forum, processed into a suitable format for training LLaMA-2.
- Base Model: The project utilized LLaMA-2-7b, a state-of-the-art LLM with 7 billion parameters.
- Fine-tuning Approach: LoRA was employed for parameter-efficient fine-tuning, aiming to maintain performance with fewer computational resources.
Experiments:
- Hyperparameter Sweep: Tested various LoRA rank values to find the optimal balance for fine-tuning.
- Data Point Variation: Evaluated the impact of different amounts of EdStem training data on model performance.
- Pre-fine-tuning: Assessed the benefits of pre-fine-tuning on the Stack Exchange dataset before fine-tuning on EdStem data.
Results:
The model fine-tuned on the EdStem dataset produced responses specific to EECS 101 but often contained inaccuracies and logical errors compared to the un-fine-tuned LLaMA-2.
The optimal LoRA rank value was determined to be 16, balancing qualitative and quantitative performance.
Increasing the number of EdStem training points improved performance, indicating the model could benefit from a larger dataset.
Pre-fine-tuning on Stack Exchange data slightly improved performance metrics, though the un-fine-tuned model often provided more logically consistent responses.
Conclusion:
The project demonstrated that fine-tuning LLaMA-2 for a specific academic context is feasible, though challenges such as model forgetting and data limitations were encountered. The use of LoRA allowed for efficient fine-tuning, and pre-fine-tuning on a broader dataset showed some benefits. Future work should focus on increasing the training dataset size and exploring more refined hyperparameter tuning to further enhance the model's performance.
|