Home

Edstem EECS101 Class Questions Chatbot

If you are not able to see the report above, please download the PDF to view it: Download PDF.

Fine-tuning LLaMA2 on EdStem Question and Answer Data using LoRA

In this project, the team explored fine-tuning a Large Language Model (LLM), specifically LLaMA-2, for generating accurate responses to questions related to EECS 101 at UC Berkeley. The fine-tuning process aimed to adapt the model to produce coherent and contextually relevant answers similar to those found on the EECS 101 EdStem forum.

Objectives and Hypotheses:

  1. Fine-tuning Feasibility: Assess whether fine-tuning LLaMA-2 can generate relevant responses to EECS 101 questions.
  2. LoRA Hyperparameters: Determine optimal LoRA (Low-Rank Adaptation) hyperparameters, focusing on different rank values (1, 4, 16, 32) to enhance computational efficiency without sacrificing performance.
  3. Pre-fine-tuning Effectiveness: Investigate if pre-fine-tuning on a larger, more general QA dataset (StackExchange) improves performance on the specialized EECS 101 dataset.

Methods:

  • Datasets:
    1. Stack Exchange Dataset: Consisting of diverse QA pairs from Stack Exchange forums.
    2. EdStem Dataset: Manually compiled EECS 101-related QA pairs from the EdStem forum, processed into a suitable format for training LLaMA-2.
  • Base Model: The project utilized LLaMA-2-7b, a state-of-the-art LLM with 7 billion parameters.
  • Fine-tuning Approach: LoRA was employed for parameter-efficient fine-tuning, aiming to maintain performance with fewer computational resources.

Experiments:

  1. Hyperparameter Sweep: Tested various LoRA rank values to find the optimal balance for fine-tuning.
  2. Data Point Variation: Evaluated the impact of different amounts of EdStem training data on model performance.
  3. Pre-fine-tuning: Assessed the benefits of pre-fine-tuning on the Stack Exchange dataset before fine-tuning on EdStem data.

Results:

The model fine-tuned on the EdStem dataset produced responses specific to EECS 101 but often contained inaccuracies and logical errors compared to the un-fine-tuned LLaMA-2.

The optimal LoRA rank value was determined to be 16, balancing qualitative and quantitative performance.

Increasing the number of EdStem training points improved performance, indicating the model could benefit from a larger dataset.

Pre-fine-tuning on Stack Exchange data slightly improved performance metrics, though the un-fine-tuned model often provided more logically consistent responses.

Conclusion:

The project demonstrated that fine-tuning LLaMA-2 for a specific academic context is feasible, though challenges such as model forgetting and data limitations were encountered. The use of LoRA allowed for efficient fine-tuning, and pre-fine-tuning on a broader dataset showed some benefits. Future work should focus on increasing the training dataset size and exploring more refined hyperparameter tuning to further enhance the model's performance.