Bridging the Food Security Gap: an information-led approach to connect dietary nutrition, food composition and crop production.

BACKGROUND
Food security is recognised as a major global challenge, yet human food chain systems are inherently not geared towards nutrition, with decisions on crop and cultivar choice not informed by dietary composition. Currently, food compositional tables and databases (FCT/FCDB) are the primary information source for decisions relating to dietary intake. However, these only present single mean values representing major components. Establishment of a systematic controlled vocabulary to fill this gap requires representation of a more complex set of semantic relationships between terms used to describe nutritional composition and dietary function.


RESULTS
We carried out a survey of 11 FCT/FCDB and 177 peer reviewed papers describing variation in nutritional composition and dietary function for food crops in order to identify a comprehensive set of terms to construct a controlled vocabulary. We used this information to generate a Crop Dietary Nutrition Data Framework (CDN-DF), which incorporates controlled vocabularies systematically organised into major classes representing nutritional components and dietary function. We demonstrate the value of the CDN-DF for comparison of equivalent components between crop species or cultivars, for identifying data gaps, as well as potential for formal meta-analysis. The CDN-DF also enabled us to explore relationships between nutritional components and functional attributes of food.


CONCLUSION
We have generated a structured Crop Dietary Nutrition Data Framework that is generally applicable to the collation and comparison of data relevant to crop researchers, breeders and other stakeholders, and will facilitate dialogue with nutritionists. It is currently guiding establishment of a more robust formal ontology. This article is protected by copyright. All rights reserved.


Introduction
Food security is recognised as a major global challenge, underlined by the need to meet sustainable calorific and nutritional requirements of a world population projected to reach nine billion within the next 15 years 1 . Whilst food security is a pre-requisite for achieving nutritional security, it is accepted that securing food supplies alone does not guarantee optimal nutritional status of a population 2,3 . Human nutrition encompasses the energy and essential nutrients required to fulfil the dietary needs of the body 4 . It has been suggested that most human food chain systems are inherently not geared towards nutrition 5 . Within most current food systems, decisions tend to be made independently by those involved in agricultural production, plant breeding, and human nutrition, with little connectivity 6 .
In practice, production decisions determining crop and cultivar choice by farmers are primarily driven by price, yield and market preference 7,8 , and thus tend to be poorly aligned with dietary needs 9,10 . Multiple examples exist, including a shift in wheat cultivars grown in W. Australia driven by export demand for cultivars meeting market specifications for Udon noodles 11 , cultivar preference in Ethiopia driven by environmental stability and adaptability 12 , and breeding selection for quinoa in the Peruvian Andes driven by mildew resistance, shorter maturation and other yield parameters 13 . In Central India, increased price has resulted in This article is protected by copyright. All rights reserved. cultivation of rice, which is of inferior nutritional value when compared with sorghum, maize and millets 14 .
Within breeding programmes, traits affecting dietary nutritional composition tend to be of less importance than those affecting yield, biotic resistance and appearance 15,16 . Plant breeders are required to produce varieties that meet market requirements for uniformity, production efficiency and product quality, including specifications from the food industry 7 . This is true for private and public sector breeding 17,18 , as well as farmer breeding cooperatives 13,19 . This is also reflected in predictive models developed for farmers to identify and rank valuable crop traits for cultivar choice 20 , where value is primarily attached to production system or market preference, rather than nutritional content. At present, decisions by nutritionists and consumers for dietary intake tend to be made at the level of individual crops or food products 21 . This has led to a recognition that the lack of cultivar-specific nutritional composition data presents a significant obstacle to wider adoption of crop varieties with improved nutritional value 9, 22, 23 . Food compositional tables and databases (FCT/FCDB) are currently the primary sources of information for formulating guidelines for food intake, such as Recommended Daily Allowances (RDA) and Dietary Reference Intake (DRI) 24, 25 , as well as for food labelling 26,27 and marketing 28 . However, in FCT/FCDB, each nutritional component is presented as a single numerical concentration that represents a mean (not median) value. Moreover, there is no indication of variation attributable to either cultivar or growing environment. Although there has been an increased This article is protected by copyright. All rights reserved.
appreciation of functional foods with potentially positive effects on human health beyond basic nutrition 29,30 , such data are not widely managed within FCT/FCDB 31 .
It is interesting and perhaps ironic that in contrast to human nutrition, decision making for dietary intake of livestock feed formulation is controlled with high precision for distinct developmental stages of each animal species, including identification of crop cultivars that meet specific dietary needs 32,33 . In addition, numerous software applications have been developed with increasing sophistication to automate feed production, most of which incorporate relative cost and nutritional content, and more recently details of feed mix and livestock dietary outcomes 34 .
There appear to be tangible benefits to harmonising access to crop nutritional data for different decision makers in the food supply chain, such as plant breeders and nutritionists.
However, this requires a common understanding of concepts to facilitate dialogue, as well as increased interoperability of data sources. Developing such infrastructure starts with shared terminology and can be facilitated by formal systems of knowledge representation 35 . For example, adoption of common language based on controlled vocabularies has been shown to facilitate access and use of comparative data sources in metabolomics 36 . However, in the case of crop nutritional composition and dietary function data, this is likely to require reevaluation of the information processing pipelines and standards used for their generation, organisation and dissemination ( Figure 1). This article is protected by copyright. All rights reserved.
It has been recognised that a systematic and formal framework for describing and organising relevant nutritional components with controlled vocabularies would add value to crop nutrition research, by increasing the interoperability of data sources between plant breeders and nutritionists 37 . However, no comprehensive formal vocabulary has yet emerged for nutritional composition of food crops. Although some attempts have been made to establish controlled vocabularies for nutritional components such as 'protein' and 'lipid', these are incomplete and inconsistent. For instance, neither Crop Ontology 38 , FoodOn 39 or OntoFood 40 represent a comprehensive set of nutritional components or dietary functions, nor provide sufficient detail in terms of structured relationships between terms 41 . This limits their utility for managing and comparing data within-and between-crops. Establishment of a systematic controlled vocabulary to fill this gap requires representation of a more complex set of semantic relationships between terms used to describe nutritional composition and dietary function.
In this paper, we compare the quality of information available in FCT/FCDB with more detailed sources of data that better reflect variation due to crop type, cultivar and interaction with production environments. We propose and outline development of the Crop Dietary Nutrition Data Framework (CDN-DF), based on controlled vocabularies systematically organised into simple hierarchal branching trees. Two main classes of terms represent nutritional components and dietary function. We include a use case of grain legumes, in order to understand the practical challenges faced when comparing nutritional information from different major and minor crops.

Development of Crop Dietary Nutrition Data Framework
The USDA Nutrient Database for Standard Reference (NDSR) 44 was used as the initial data source representing a well-developed national FCDB, from which 144 nutritional components were identified and allocated unique entity terms within the data framework  Table 1). Where synonyms of a given entity occurred in the literature, the term allocated was that closest to being of relevance to nutritionist (e.g. the fatty acid term 'oleic acid' was fixed, although it may appear in different literature and in ChEBI as '(9Z)-Octadec-9-enoic acid' or 'cis-∆ 9 -octadecanoic acid 18:1 cis9') This article is protected by copyright. All rights reserved.
The CDN-DF was designed and presented in Microsoft Excel TM (2013) spreadsheets for two major classes representing nutritional components and dietary function. The vocabulary within each class consists of unique entity terms are arranged within a hierarchical branching tree implemented using the 'Group' function within Excel (Supplementary Tables 2 & 3).
Within the nutritional component class, entity terms were allocated to six primary categories, each representing the root node with one or more branches. The class tree was extended with three additional levels representing unique terms for progressively specific sub-categories, and a final level that was intended to correspond to the smallest bioavailable molecule (Supplementary Table 2 Tables 1 & 3).
Finally, corresponding ontology terms from Chemical Entities of Biological Interest (ChEBI) 45 were assigned to entity terms within both class trees ( Figure 2).

Grain legume nutritional data
Nutritional composition data were collated within the CDN-DF for a subset of five grain legume (syn. pulse) species of the Fabaceae: soybean (Glycine max), chickpea (Cicer ariteneum), cowpea (Vigna unguiculata), mungbean (V. radiata), and a taxonomically related  Table 4). Data from six sources were extracted to assess the relationship between total phenolic content (TPC) and anti-oxidative capacity as determined by diphenyl-1-picrylhydrazyl (DPPH) scavenging activity ( Figure 3).

Comparison of different sources of nutritional data
We evaluated a range of different data sources in order to identify suitable terms for a controlled vocabulary representing nutritional components. Analysis of the compiled datasets (  This article is protected by copyright. All rights reserved. Crop cultivar-specific data that reflect variation in nutritional composition, growing season or cultivation practice are absent in all nine national FCT/FCDB (Table 1) the Survey included physical parameters such as 1,000 seed weight, hydration and swelling capacity, and cooked firmness, along with physiochemical properties such as starch characteristics (peak viscosity, peak time, pasting temperature) 43 .
Another issue that hinders the direct comparison, inter-operability and sharing of crop nutrition data is the considerable variation in units used to report nutritional composition data within FCT/FCDB and research literature. In some cases concentrations of compounds are This article is protected by copyright. All rights reserved.
presented in up to four different units. For example, amino acid content was variously reported as g/16gN, % protein, % dry matter and g/kg protein; monosaccharide content as either mg/g dry matter basis, g/100g sample or % of sugar; and fatty acid content as % of total fatty acid, % in oil and mg/100g of sample. The use of different units appears primarily dependent on the analytical approach taken, but may also reflect available equipment, historical adoption of specific methods, or development of in-house methodologies.

Crop Dietary Nutrition Data Framework
Here we propose the establishment of the CDN-DF as a structured controlled vocabulary, organised within two major classes representing nutritional components (Supplementary Table 2) and dietary function (Supplementary Table 3). The vocabulary within each class consists of unique entity terms that are arranged within a hierarchical branching tree. ( Figure   2). A maximum of five levels were defined for the nutritional component class and three levels for the dietary function class, corresponding to a progressively granular representation.
A key property of this organisation is that the sum of component values corresponding to entity terms described in lower branches/levels may be used as proxies for the level above.
Within the nutritional component class, 545 entity terms were allocated to six primary categories, each representing the root node with one or more branches. These closely correspond to the major proximate components 46 : carbohydrate, protein and lipid as well as mineral, vitamin and secondary metabolite. The class tree was then extended with three additional levels representing unique terms for progressively specific sub-categories, and a This article is protected by copyright. All rights reserved.  Table 2). For the dietary function class a similar approach was taken, with a set of preliminary yet non-exhaustive list of primary categories identified: anti-nutritional factors, food toxins, phytonutrients and antioxidants. At the second level more specific functional sub-categories were allocated.
In practice, the CDN-DF is available to be used for facilitating literature and database searches followed by the recovery, collation and curation of data from multiple sources. For any particular dataset, individual data records and values should be allocated to a unique term within the CDN-DF, dependent upon the appropriate level at which it has been described.
The use of the controlled vocabulary and associated term may then be incorporated within a curation database underlying comparative or meta-analyses. It is important to recognise that a key step in the collation of data from multiple sources involves ensuring that the reporting units associated with each entity term are consistent or undergo appropriate conversion. Table 4).

Concentrations of total starch, resistant starch, starch amylose and reports of in vitro and in vivo glycaemic index (GI) showed intra and inter species ranges (Supplementary
This article is protected by copyright. All rights reserved. We also assessed the relationship between the nutritional component TPC and the functional attribute anti-oxidative capacity, as quantified by DPPH scavenging activity (Figure 3). The results indicated that anti-oxidative capacity is effected by both crop and cultivar selection, with mungbean cultivars showing the greatest DPPH scavenging capacity, followed by cowpea, soybean and chickpea (Figure 3). Our analysis also showed that mungbean and cowpea cultivars have a greater range of TPC concentration and DPPH scavenging activity in comparison to soybean and chickpea cultivars. No adequate data sources were identified for bambara groundnut.

Discussion
We carried out a survey of 11 FCT/FCDB and 177 peer reviewed papers describing variation in nutritional composition and dietary function for food crops in order to identify a comprehensive set of terms that could be used to construct a controlled vocabulary. We used this information to generate the CDN-DF, comprised of a systematic allocation of entity terms to two major classes organised as simple branching hierarchical trees.  (Table 1), which may mask variance due to different sampling and analysis protocols 46 as well as the actual range, variance and skewness in nutritional components due to cultivar 51 or growing environment 52 .
This article is protected by copyright. All rights reserved.
The reporting of a single crop compositional value may also fail to reflect variation associated with regional crop production and markets, which has been recognised as limiting the impact of crop biodiversity on food systems and nutrition 53 . Increasing the availability of cultivar-specific nutritional data is particularly relevant for regional decision-making by farmers or processors for production and sale into specific markets 8,9,54 , and indeed may help in the development of markets sensitive to nutritional composition.
At the consumption end of the food system, poor awareness of crop compositional variation may distort estimates of nutrient intake, particularly in distinguishing between micronutrient deficiency and adequacy. For example, grain protein concentration of rice cultivars has been reported to range 2.8 fold (5g -14g/100g) 55 , whilst banana beta-carotene content can vary dramatically from 1 µg to 8,500 µg/100 g fresh weight between varieties 56 . The limited availability of comparative dietary function data within FCT/FCDB also limits the ability to manage human diet at the level of crop or cultivar 57,58 . However, there are ongoing efforts led by the team responsible for the NDSR to widen the number of secondary metabolite compounds included in FCT/FCDB 25,44,59 . In addition, reports from crop-specific trade bodies such as the US Pulse Quality Survey 42, 43 may provide nutritional composition data at cultivar level that affects market price and farmer cultivation decisions 8,14 .
There are also notable exceptions in the use of crop and cultivar-specific nutritional data in the nutraceutical and functional food sector where cultivar development has been vertically This article is protected by copyright. All rights reserved. When implemented in spreadsheets, we have found that this structured vocabulary facilitates navigation and exploration of nutritional terms, as the hierarchical tree branches may be collapsed or expanded (Supplementary Tables 2 & 3). Once populated with data curated from multiple referenced sources, this schema allows rapid comparison of equivalent components between crop or cultivar, as well as identifying data gaps (Supplementary Table 4). This also provides a potentially valuable tool for formal meta-analysis, which relies on referenced data collation and management 65 .  Table 3). Although preliminary, we recognise that there is considerable scope to extend and refine this vocabulary, and represent the various complex relationships that exist between dietary function and relevant nutritional components 66  This article is protected by copyright. All rights reserved.
As a use case we analysed grain legume datasets organised using the controlled vocabulary and hierarchical relationships within the CDN-DF. Although this indicated that bambara groundnut was under-represented compared with the other four crops, we were able to establish that reported variation in the concentrations of protein, fatty acid and minerals covered a similar range as for major crops such as chickpea, cowpea and mungbean 69 .
However, gaps in data available for starch digestibility, vitamins, and the majority of phytochemicals and anti-nutritional factors highlight where additional datasets could be generated. Notwithstanding these gaps, the data suggest that there is sufficient variation within the global bambara groundnut genepool to develop high protein cultivars, and improve concentrations of unsaturated fatty acids.
The use case also demonstrated the value of combining compositional and functional data within the same structured framework. For example, we presented a positive correlation between TPC and anti-oxidative capacity (Figure 3), in agreement with well-established findings that phenolic compounds in grain legumes contribute to their antioxidant capacity 70,71 . The variation observed also suggests that TPC is a valid target for selection within breeding programs. In contrast, the relationship between functional attribute GI and food composition is more complex, and has been associated with resistant starch, typically attributed to higher concentrations of starch amylose 72,73 . For the five grain legumes, we were able to mine the available data to illustrate the considerable intra-species variation reported for starch amylose and resistant starch concentration, as well as for in vivo and in vitro GI values (Supplementary Table 4).
This article is protected by copyright. All rights reserved.
This comparison also highlighted the lack of cohesiveness at the crop cultivar level that otherwise would permit inference of any valid conclusions with respect to the interaction between these parameters. Given the growing need to manage diabetes in global populations 74 , and reports of grain legumes being 'low GI' foods 72,75,76 , more comprehensive surveys of crops and cultivars are required to establish functional interactions between specific food components and GI, with the latter determined using standardised in vivo methods.
The CDN-DF represents a first step in facilitating the harmonisation of data sources and navigation of datasets for comparative analysis both within and between crops. To extend this further and increase access, sharing and re-use of datasets requires development of a formal ontology, able to be machine and human readable. Such features are notably lacking from FCT/FCDB. The structured vocabulary we have defined here is under-pinning the establishment of a Crop Dietary Nutrition Ontology (CDNO) 41 , which is expected to increase interoperability of data sources between breeders and nutritionists.
It is timely to develop standardised frameworks for knowledge representation relating to crop nutrition that adhere to the principles of F.A.I.R. (Findable, Accessible, Interoperable and Reusable) data management 77 . Initiative such as the Breeding API (BrAPI) 78 and MIAPPE 79 are enhancing the ability of pre-breeding scientists and plant breeders to compare and make use of data from diverse sources. Likewise, development of formal systems of knowledge representation including ontologies have contributed to progress in the sophistication of This article is protected by copyright. All rights reserved.
nutritional epidemiology research, leading to the recent development of the Ontology for Nutritional Epidemiology (ONE) 80 .
The particular value of the CDN-DF lies in its ready implementation and immediate availability to assist in collation of diverse datasets. The framework includes a hierarchal structure with controlled vocabularies both for nutritional composition data and for dietary function. We have demonstrated its value by compiling data for grain legumes, and deriving valuable information relating composition and functional nutrition. We anticipate the CDN-DF will play a role in wider endeavours to add value from F.A.I.R data exchange such as the Divseek International Network 81, 82 , and increase the ability of researchers, breeders and other stakeholders to compare data. This may include supplementing current FCT/FCDB with reciprocal data links and should allow for a more robust understanding of how crop type and cultivar contribute to dietary nutrition.