This paper tackles perhaps the most pressing question in labor economics right now: what happens to workers when firms gain access to large language models? The question alone warrants serious attention. The data construction — linking O*NET task descriptions to GPT capability assessments to construct an occupation-level exposure index, then tracking CPS employment outcomes — is creative and well-executed. I can see this becoming a standard approach in the AI-and-labor literature.
That said, both referees converge on a fundamental concern that I share: the paper cannot distinguish displacement from augmentation using its current design. The exposure measure captures which occupations could be affected by LLMs, not how they are affected. The negative employment result could reflect displacement, but it could equally reflect compositional shifts (firms hiring fewer but more productive workers in these occupations) or pre-existing trends (these occupations were already declining). The Methods Referee's suggestion to decompose exposure into substitution and complementarity components is, in my view, essential — not just for this paper, but for the entire literature that will follow it.
The pre-trends concern is serious but not fatal. If the authors can show a clear acceleration after November 2022 on top of the existing trend, and if the Rambachan & Roth (2023) sensitivity analysis shows the result survives reasonable violations, I would be satisfied. The Domain Referee is right that the paper should engage with Brynjolfsson et al. (2025) and the growing augmentation evidence — the paper currently reads as if augmentation doesn't exist.
Domain Referee (Structuralist): 73/100 — Major Revisions Finds the question first-order important but the theoretical framework underdeveloped. Wants a simple model distinguishing displacement from augmentation before going to the data. Concerned that the paper doesn't engage with the augmentation evidence and will age poorly if the employment effects reverse.
Methods Referee (Credibility Revolution): 71/100 — Major Revisions Core shift-share design is competently executed but missing standard diagnostics (Goldsmith-Pinkham et al. 2020, Adao et al. 2019). Three major concerns: exposure measure conflates displacement/augmentation, pre-trends start in 2019, and result is driven by 3 of 22 SOC groups. All addressable with specified remedies.
[FATAL if unaddressed] Decompose the exposure index into substitution and complementarity components. Show which component drives the employment decline. Without this, the paper's headline finding is uninterpretable. (Methods Referee, Major #1; Domain Referee, Major #1)
[ADDRESSABLE] Implement Rambachan & Roth (2023) sensitivity analysis to quantify how large a pre-trend violation would need to be to explain the result. Show whether the treatment effect accelerates after November 2022 rather than merely continuing the 2019+ trend. (Methods Referee, Major #2)
[ADDRESSABLE] Report the full suite of Goldsmith-Pinkham et al. (2020) shift-share diagnostics: Rotemberg weights, leave-one-out sensitivity, and Adao et al. (2019) standard errors. The result being driven by 3 SOC codes needs to be confronted directly. (Methods Referee, Major #3)
[ADDRESSABLE] Engage seriously with the augmentation evidence (Brynjolfsson et al. 2025, Noy & Zhang 2023, Peng et al. 2023). The paper currently presents a displacement-only narrative. At minimum, discuss why occupation-level results might differ from the firm-level augmentation findings. (Domain Referee, Major #2)
[ADDRESSABLE] Add a placebo test using the 2012-2018 period with 2023 exposure scores. If the exposure index predicts employment changes before LLMs existed, the identifying variation is contaminated. (Methods Referee, Technical Suggestions)
[ADDRESSABLE] Discuss the zero wage effect alongside the negative employment effect. These are jointly inconsistent under standard models and the paper needs to take a position on why. (Methods Referee, Minor #4)
[TASTE] Domain Referee wants a formal model with displacement and augmentation equilibria. I do not require this — a clear conceptual framework with testable predictions is sufficient. A full model risks overwhelming a paper whose contribution is empirical. Authors may include one in an appendix if they wish.
[TASTE] Domain Referee suggests restricting the sample to 2022-2024 to avoid the pre-trends issue. This sacrifices the pre-period event study entirely. I prefer keeping the full sample with the Rambachan & Roth approach.
The Domain Referee wants a structural model motivating the empirical design. The Methods Referee wants cleaner reduced-form identification. These are not contradictory — a conceptual framework that generates the substitution/complementarity decomposition would satisfy both. The decomposition is the key revision. I recommend the authors frame it as: "our exposure index has two components; here is how to think about each; here is what we find for each." This addresses the Domain Referee's theoretical concern and the Methods Referee's identification concern simultaneously.
The Domain Referee also argues the paper should present firm-level evidence to complement the occupation-level analysis. I agree this would strengthen the paper but do not require it for this revision. If occupation-level data from the CPS is the contribution, that is sufficient — but the limitations relative to firm-level studies must be discussed honestly.
I expect to receive the revision within 6 months. The decomposition of the exposure index is the critical revision — if it substantially changes the results (e.g., negative effects concentrate in high-complementarity rather than high-substitution occupations), please contact me before submitting the full revision to discuss reframing.
This paper estimates the effect of LLM adoption on occupation-level employment, wages, and task composition using a difference-in-differences design that exploits variation in occupational exposure to ChatGPT capabilities. The exposure measure combines pre-period task content (from O*NET) with the timing of GPT model releases as a shift-share instrument. The question is timely and the data construction is impressive. However, the identification strategy has two fundamental concerns: the exposure measure conflates potential displacement with potential augmentation, and the parallel trends assumption is difficult to defend when high-exposure occupations were already on differential trends pre-2022.
The event study in Figure 3 shows that high-exposure occupations were already declining in employment relative to low-exposure occupations from 2019 onward. The paper's explanation — that COVID differentially affected these occupations — is plausible but untested. If these occupations were on a differential trajectory for reasons unrelated to LLMs (e.g., ongoing automation via RPA, offshoring), the DiD estimate captures this trend, not the LLM effect. -
(a) Implement Rambachan & Roth (2023) sensitivity analysis showing how large a linear pre-trend violation would need to be to explain away the result. (b) Control for pre-period automation exposure (robots, RPA software adoption) interacted with time. (c) Show that the treatment effect accelerates after November 2022 (ChatGPT launch) relative to the pre-trend — a level shift on top of the trend would be more convincing than just continuing the trend.
The shift-share design uses pre-period O*NET task shares as weights and GPT release timing as shocks. Goldsmith-Pinkham, Sorkin & Swift (2020) show that identification in shift-share designs can come from either the shares or the shocks. The paper does not report: (a) which occupations have the largest Rotemberg weights, (b) whether the result is sensitive to dropping high-leverage occupations, or (c) the Adao, Kolesar & Morales (2019) standard errors that account for cross-occupation correlation in the shocks. -
Report the full Goldsmith-Pinkham et al. (2020) diagnostics. Show the top-5 Rotemberg weight occupations. Report results dropping each one. Implement Adao et al. (2019) SEs. If the top-weight occupations are sensible and the result survives leave-one-out, this concern is resolved.
The CPS sample excludes self-employed workers. If LLMs enable freelancing (writers, coders, designers shifting to self-employment), the employment decline in the CPS may overstate actual job destruction. Discuss this and, if possible, supplement with ACS data that captures self-employment.
Table 2 reports the exposure index quintile cutoffs but does not report which specific occupations fall in each quintile. Add an appendix table mapping quintiles to SOC codes with occupation names. Readers need to verify that the groupings make economic sense.
The GPT capability scores (used to construct the exposure index) are generated by GPT-4 itself. This circularity should be discussed. Webb (2020) uses patent-based measures as an alternative that avoids self-assessment bias.
The wage results (Table 4) show zero effect on hourly wages despite significant employment decline. This is inconsistent with a competitive labor market model. Either wages are sticky downward, composition effects mask the wage decline (displaced workers are lower-paid), or the employment result is spurious. The paper should discuss which interpretation it favors.
Implement the Borusyak, Hull & Jaravel (2022) shift-share framework as the primary specification. This provides both the recentered instrument and the correct SEs for shift-share designs with potentially endogenous shares.
Add a placebo test using the 2012-2018 pre-period: assign "treatment" based on 2023 exposure scores and estimate the same model. If you find effects during this period, the exposure index is capturing secular trends, not LLM-specific impacts.
Consider a triple-difference design: high vs. low exposure occupations, before vs. after ChatGPT, in industries with high vs. low AI adoption rates (from the Census Business Trends survey). This third difference controls for occupation-specific trends.
The exposure index is constructed at the 6-digit SOC level but the analysis is at 2-digit SOC. How sensitive are results to the aggregation level? Do results hold at 4-digit SOC?
What happens to the employment result if you control for remote work feasibility (Dingel & Neiman 2020)? High-exposure occupations overlap substantially with remote-compatible occupations, which had independent labor market dynamics post-COVID.
The paper ends in 2024Q2. ChatGPT launched November 2022. This is 18 months of post-treatment data. Is this enough time for firm-level adoption decisions to translate into occupation-level employment changes? What does the technology adoption literature suggest about timing?